NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.83k stars 2.46k forks source link

Mandarin ASR, predicitions stay as BLANK sequences #251

Closed SweetFov closed 4 years ago

SweetFov commented 4 years ago

Hi, appreciate this great framework and your great work! Your pretrained Mandarin quartznet has very good performance on Aishell Testset, so I want to train the same model arch on our own Mandarin reading-style data from scratch; The train script is like this: python -m torch.distributed.launch --nproc_per_node=2 ./jasper_aishell.py --batch_size=8 --num_epochs=150 --lr=0.00005 --warmup_steps=1000 --weight_decay=0.00001 --train_dataset=./word_4000h/lists/train.json --eval_datasets ./word_4000h/lists/dev_small.json --model_config=./aishell2_quartznet15x5/quartznet15x5.yaml --exp_name=quartznet_train --vocab_file=./word_4000h/am/token_dev_train_4400.txt --checkpoint_dir=$checkpoint_dir --work_dir=$checkpoint_dir The training data is about 500 hours long. At first, the prediction is pretty much random;Then after several thousand iterations(before warmup ends), the predicitions stays as BLANK sequences for two epochs like this: Step: 4650
2020-01-07 09:53:20,694 - INFO - Loss: 110.91824340820312 2020-01-07 09:53:20,694 - INFO - training_batch_CER: 100.00% 2020-01-07 09:53:20,694 - INFO - Prediction: 2020-01-07 09:53:20,694 - INFO - Reference: 提起华华家的事情村民们声声长叹 Step time: 0.39273500442504883 seconds

I have tried the learning rate from 0.1 to 0.00005, warmpup steps from 1000 to 8000, batch size as 4,8,16,32, weight_decay from 0.001 to 0.00001, and none of those combinations could solve this problem. Have you ever encountered this kind of problem?

vsl9 commented 4 years ago

Hi @SweetFov, thanks for your interest in NeMo and QuartzNet.

  1. What about train loss? First of all, I'd suggest to try to keep train loss consistently going down.
  2. Usually, we train using larger batch sizes (and learning rates). How long are utterances in your dataset? We make them not longer than 16.7 seconds to fit larger batches in GPU memory (with max_duration: 16.7 in YAML file).
  3. I see that you use a custom vocabulary. Is it different from AISHELL-2 vocab? Can you please double check that the vocabulary is correct and normalize_transcripts: False set in YAML file?
  4. Have you tried to remove warmup completely?
  5. It is possible to do transfer learning from a pre-trained AISHELL-2 model to your custom dataset. In that case, if vocabularies are different, then you can take the pre-trained encoder only, randomly initialize the decoder and train with smaller initial learning rate. @Slyne, any other suggestions?
Slyne commented 4 years ago

It seems like you use a different vocab.txt...Could you try use the original vocab.txt and see if the loss will decrease? If you want to keep your own vocabulary file, you may just refer to point 5 mentioned by @vsl9.

okuchaiev commented 4 years ago

closing as this is related to the old version

twmht commented 1 year ago

same with the current version of nemo.

I fixed the encoder and only finetune the decoder, with the new vocabulary set.

Why can't we use new vocaburay set?