Closed egorsmkv closed 2 weeks ago
Hi, thanks a lot for attention! Your speech is normal, it's slow because 1) we rewrite the decoding to extract scores as well as do utt-level filtering (requires 5-time repetitive decoding), 2) whisper large has 1.5b parameters.
For speed up, we suggest you 1) use beam search (instead of 5-time repetition) for utt-level filtering, or 2) use smaller whisper model.
Btw, our star scores are designed for efficient finetuning, which usually requires less than 1-2k utterances for training, so in our experiments we don't have trouble in speed. So we suggest you evaluate the performance vs. num of train samples.
Thank you, @YUCHEN005 !
I tried to use a smaller model, but it seems like the code is only for large model.
I've used small-whisper and these lines were a problem:
the number of layers in smaller model is different
Hi, this line indicates a hyper-parameter on which layer and which head's attention matrix you want to use, we find it empirically to be better using last two layers, and the head id may depend but not very important. You can specify layer id depending on number of layers in smaller model.
First of all, thanks for your work!
I have made a fork of this project - https://github.com/egorsmkv/STAR-Adapt-uk
My test has shown that the fine-tuning is very slow (the process of feature extraction / data preparation).
Dataset (Common Voice 10, Ukrainian subset):
The process maybe almost 15 hours...
Is there a way to speed it up?