ARman-AT commented 3 months ago

I am finetuning a conformer-transducer model of my own and everything were fine until epoch 2. then suddenly validation wer goes from under 0.2 to 1 and every batch for validation takes more than 5 seconds while the training process is still normal. this is my configuration :

!/bin/bash

cat fine_tune_rnnt.sh

python speech_to_text_rnnt.py \ --config-path=/home/fine_tune \ --config-name=Conformer-Transducer-Char.yaml \ +model.train_ds.manifest_filepath=/home/train_filimo_agp_commonvoice.json\ +model.validation_ds.manifest_filepath=/home/val_filimo_agp_commonvoice.json\ trainer.devices=1 \ trainer.accelerator='gpu' \ trainer.max_epochs=10 \ model.train_ds.batch_size=24 \ trainer.accumulate_grad_batches=16 \ model.train_ds.num_workers=8 \ model.train_ds.pin_memory=True \ model.validation_ds.pin_memory=True \ +model.train_ds.parser=base \ model.validation_ds.batch_size=16 \ model.validation_ds.num_workers=8 \ trainer.val_check_interval=0.2 \ trainer.log_every_n_steps=100 \ +model.augmentor.transcode_aug.codecs=[g711] \ ++model.augmentor.transcode_aug.prob=0.1 \ +model.augmentor.white_noise.min_level=-50 \ +model.augmentor.white_noise.max_level=-30 \ +model.augmentor.white_noise.prob=0.1 \ trainer.precision=16 \ model.train_ds.shuffle=1\ +model.augmentor.rir_noise_aug.rir_manifest_path=data/rir/processed/rir.json \ ++model.optim.lr=0.1\ ++model.optim.sched.warmup_steps=0 \ +init_from_nemo_model : /home/roshan/Conformer-Transducer-Char-1-23.nemo

i am using nemo 1.23 with cuda version of 12.1 and cudnn version of 8902 on RTX-3090 could anyone help? thank you in advance

nithinraok commented 3 months ago

Your lr is very high, start with 3e-4 for finetuning. Model might have exploded to NaNs. Could you swicth to fastconformer architecture instead of Conformer? FastConformer is quick to train. start with fp32 then move to precision 16 once your training setup is fine and you see curves are normal.

Fastconformer configs: https://github.com/NVIDIA/NeMo/tree/main/examples/asr/conf/fastconformer

ARman-AT commented 3 months ago

Your lr is very high, start with 3e-4 for finetuning. Model might have exploded to NaNs. Could you swicth to fastconformer architecture instead of Conformer? FastConformer is quick to train. start with fp32 then move to precision 16 once your training setup is fine and you see curves are normal.

Fastconformer configs: https://github.com/NVIDIA/NeMo/tree/main/examples/asr/conf/fastconformer

thanks for replying. i actually changed the precision from 16 to 32 and it solved my problem.

NVIDIA / NeMo

slow validation process #9202

!/bin/bash