Closed rrscholarship closed 5 months ago
Hi,
The array of audio is the decoded audio input (It is not calculated). It can be passed as input to the Whisper model to generate n-best hypotheses and audio encoding (output of Whisper Encoder) to be used by the setup . You can go through the code in the data_preparation directory to use other datasets with this setup. Also refer to this repo (also contains code for generating n-best hypotheses).
And 1 GPU with 24GB might not be enough to train LLaMA (7B) with adapters. I used 1 A100 (80GB VRAM) to train, you can also use multiple smaller GPUs with FSDP.
I did not attach the pretrained adapter weights, I don't have access to KAUST compute anymore. I will check with my friends and try to add it.
Not sure if you have pretrained adapter weights? for some reason I am not able to train adapter in resource constraint scenarios. Training the Adapter needs large amount of resources, why is that?
Another question, if I would like to use other dataset, how do I get array of that audio? array([0.0005188 , 0.00085449, 0.00012207, ..., 0.00125122, 0.00076294, 0.00036621]. how is it calcualted?
Thanks in advance