Conformer Model Training Details.

NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

Apache License 2.0

11.77k stars 2.45k forks source link

Conformer Model Training Details. #4183

Closed evilc3 closed 2 years ago

evilc3 commented 2 years ago

Hi I am training a Conformer Model and need to know if thing about the training process.

My Training params :

learning rate : 0.1,
batch_size = 128
max_epochs : 100
warmup steps : 10000

The questions I had were.

What hardware was used to train the ctc_conformer_large and ctc_conformers_medium models. gpus and how long were they trained (hrs).
If my learning rate has reduced to 1e-6 will the model still continue learning or should I be stopping at this point. asking becz using the noam scheduler the learning rate has decreased quite a bit in the first 2 epochs only and hv another 98 to go.
A curious question what happens if my batch size is 4 and a set up a really high grad accum like 16 will his work?

@titu1994

thanks

titu1994 commented 2 years ago

(1) V100s (32 GB) or A100s (80 GB) were used for training these models. You don't need that much compute to fine-tune though. (2) LR is too low. Pick your warmup and LR scaler such that your LR doesn't drop below 1e-5 at worst. Best to keep it around 5e-5 or slightly higher. (3) We don't experiment with such high grad accumulation. We dont get good results with such high values normally. You can try to get batch size 128 or 256 at least and that should lead to somewhat stable convergence, but grad accum might also work

evilc3 commented 2 years ago

Thanks for the quick reply @titu1994. This has cleared lot of doubts.

I am using a new tokenizer same language so it not fine-tuning I guess. Dont have access to A100s how many v100s (32gb) did ul use? just one?

Becz a v100s (16gb) is unable to fit a batch size of 16.
More details about my learning I am able to get 10% wer on my validation set. with no overfitting. on the first epoch. but second epoch it only reduced by 0.7 wer.

I check the conformer_ctcbpe.yaml and min_lr is set to 1e-6. Should I set it to 1e-5?

titu1994 commented 2 years ago

Conformer converges fast, and the longer you train the better it gets (at least before it overfits on the dataset). Ok V100 you can use batch size 8 and accumulate grad 2x.

We trained on global batch sizes of 2048, so very large multinode experiments. That's not required for fine tuning.

If you need to change the Tokenizers but keep same language, is there any special reason ? If special characters are added, CTC model wer will not go down beyond a point.

If you're using Noam, with default LR and warmup, you will never actually reach 1e-5 even - it's limit LR of that setup of Noam is something around 2e-5 after millions of steps

evilc3 commented 2 years ago

I hv added punctuation's in my tokenizer along with reserve words. Any reason why punctuations were not added in the conformer_large vocab. does ctc models cant learn punctuations?

and the size is only 128. I was under the impression that large vocab give better results.

titu1994 commented 2 years ago

Large vocab = better scores is in NLP. For CTC models you have to usually balance your vocab size so as to make the alignment task not too difficult, but still reduce the encoded text size enough to work with 4x or 8x downsampling.

Punctuation is not added for a reason, if doesn't work well in ASR. CTC models don't have implicit LM in them unlike RNNT, so it's even harder to learn such punctuation. They may do if somewhat, but it won't be done well.

evilc3 commented 2 years ago

Oh thanks for the info I understand now. So best pipeline for ctc asr is asr -> langauge model (Ngram) -> punct model.

A new more details on my validation set the pretrained model got a 18% wer (It has puncts). and the current model I am training gets a 10%.

I will leave it for training for a couple of epochs as its nowere close to overfitting. But since I set initial lr to 0.1 using fp16 to avoid Nan loss. my lr abt to go below 1e-5 only worried abt that maybe setting min_lr = 1e-5 will help.

titu1994 commented 2 years ago

Interestingly, if your new vocab size is same as 128, then you can load the previous checkpoint and converge faster. If should normally also have better final wer from our experience.

Another thing which isn't well documented yet, but does have a tutorial is Adapters support for ASR models (https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/asr/asr_adapters/ASR_with_Adapters.ipynb).

One of the prime cases is where you don't change Tokenizers, but for faster convergence with your case, you could try to change the decoder, load up the weights, add adapters, freeze the rest of the model and just train the adapters + the new decoder. Note that you have to manually do model.decoder.unfreeze() in the section detailing training only the adapters , since you want to train on a new vocab.

Well, it is somewhat new feature but we do see very fast convergence usually, and if takes less memory + compute than full encoder fine-tuning. If you have time you could try it out after the current run is finished to adequate results.

evilc3 commented 2 years ago

My vocab is 256. so I am only using the encoder.

Thanks so much will give it a try. The size of dataset will also matter for the fast convergence right. I hv a dataset of 15k hrs.

What happens if later I decide to train with a dataset of just 50hrs will this affect the wer as the size of data has decreased?

titu1994 commented 2 years ago

Your training dataset is 15k hours? Ignore the adapters stuff then, full finetuning will be much more useful on that much data.

When you find tune in just 50 hours, your model will mostly degrade quite a bit of you fine-tune everything. You could freeze the encoder and finetuned just the decoder, or try adapters on such a dataset without needed to fine-tune the decoder then.

We've seen adapters work for as little as 30 mins of data, 50 hours has many options during finetuning

titu1994 commented 2 years ago

Also, vocab size is yours to determine, you can recompute a Tokenizers with 128 size, it would probably be worth the improved wer from the pretrained models decoder layer.

evilc3 commented 2 years ago

got it thanks for the help

VahidooX commented 2 years ago

I would suggest to first drop the punctuation and evaluate the current model on it to have a better metric. We do our experiments on small and medium sizes with effective batch size of 2K to speedup the training with more GPUs. If you have limited number of GPUs, you may use grad accumulation to get to something around 256. Batch size 128 may also work in most cases.

As @titu1994 suggested, if you have around 15K hours of English, then you may just fine-tune the model with the same default lr. If you use Noam scheduler, lr lower than 0.2 is going to be too low even for fine-tuning on small data. For small data, you may also try lr=0.5.

riqiang-dp commented 2 years ago

May I ask why do you guys suggest to set lr so that it never goes below 1e-5? I'm finetuning one model and it goes below that but I see the model is still effectively learning (loss, wer going down on train/val)

titu1994 commented 2 years ago

At such low lr, you're not getting efficient training anymore. Sure your loss will drop but even after 100k steps you won't improve much so it's a waste of compute.

mrkito commented 1 year ago

@titu1994 Hi ! Do you have learning curves for other learning rate for conformer(large)? How did you choose augmetnation params( Do you have learning curve) ?