YUCHEN005 / STAR-Adapt

Code for paper "Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models"
242 stars 3 forks source link

Data preparation is VERY SLOW #1

Closed egorsmkv closed 2 weeks ago

egorsmkv commented 1 month ago

First of all, thanks for your work!

I have made a fork of this project - https://github.com/egorsmkv/STAR-Adapt-uk

My test has shown that the fine-tuning is very slow (the process of feature extraction / data preparation).

Dataset (Common Voice 10, Ukrainian subset):

The process maybe almost 15 hours...


Is there a way to speed it up?

YUCHEN005 commented 1 month ago

Hi, thanks a lot for attention! Your speech is normal, it's slow because 1) we rewrite the decoding to extract scores as well as do utt-level filtering (requires 5-time repetitive decoding), 2) whisper large has 1.5b parameters.

For speed up, we suggest you 1) use beam search (instead of 5-time repetition) for utt-level filtering, or 2) use smaller whisper model.

Btw, our star scores are designed for efficient finetuning, which usually requires less than 1-2k utterances for training, so in our experiments we don't have trouble in speed. So we suggest you evaluate the performance vs. num of train samples.

egorsmkv commented 1 month ago

Thank you, @YUCHEN005 !

egorsmkv commented 1 month ago

I tried to use a smaller model, but it seems like the code is only for large model.

I've used small-whisper and these lines were a problem:

image

the number of layers in smaller model is different

YUCHEN005 commented 1 month ago

Hi, this line indicates a hyper-parameter on which layer and which head's attention matrix you want to use, we find it empirically to be better using last two layers, and the head id may depend but not very important. You can specify layer id depending on number of layers in smaller model.