Wav2vec 2.0 fine-tuning with language model code doesn't runs

amant555 commented 4 years ago

I am running the command to fine tune the model given in readme. But the execution stops automatically without any error. when ran without lm model. WER stays at 100.

using KenLM model (4-gram)
command: python train.py --distributed-world-size 1 --distributed-port 0 /content/ --save-dir /content/wav2vec2exp/finetunning/ontop --fp16 \ --post-process letter --valid-subset valid --no-epoch-checkpoints --best-checkpoint-metric wer --num-workers 32 \ --max-update 80000 --sentence-avg --task audio_pretraining --arch wav2vec_ctc --w2v-path /content/wav2vec2exp/pretraining/ontop/checkpoint_best.pt \ --labels ltr --apply-mask --mask-selection static --mask-other 0 --mask-length 10 --mask-prob 0.5 --layerdrop 0.1 \ --mask-channel-selection static --mask-channel-other 0 --mask-channel-length 64 --mask-channel-prob 0.5 --zero-infinity \ --feature-grad-mult 0.0 --freeze-finetune-updates 10000 --validate-after-updates 10000 --optimizer adam \ --adam-betas '(0.9, 0.98)' --adam-eps 1e-08 --lr 2e-05 --lr-scheduler tri_stage --warmup-steps 8000 --hold-steps 32000 \ --decay-steps 40000 --final-lr-scale 0.05 --final-dropout 0.0 --dropout 0.0 --activation-dropout 0.1 --criterion ctc \ --attention-dropout 0.0 --max-tokens 1280000 --seed 2337 --reset-optimizer --log-format json --log-interval 500 --ddp-backend no_c10d

result:

2020-09-04 12:04:04 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.1, adam_betas='(0.9, 0.98)', adam_eps=1e-08, all_gather_list_size=16384, apply_mask=True, arch='wav2vec_ctc', attention_dropout=0.0, best_checkpoint_metric='wer', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='ctc', curriculum=0, data='/content/', data_buffer_size=10, dataset_impl=None, ddp_backend='no_c10d', decay_steps=40000, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=0, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', dropout=0.0, dropout_input=0, empty_cache_freq=0, enable_padding=False, fast_stat_sync=False, feature_grad_mult=0.0, final_dropout=0.0, final_lr_scale=0.05, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, freeze_finetune_updates=10000, hold_steps=32000, init_lr_scale=0.01, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, labels='ltr', layerdrop=0.1, localsgd_frequency=3, log_format='json', log_interval=500, lr=[2e-05], lr_scheduler='tri_stage', mask_channel_length=64, mask_channel_other=0.0, mask_channel_prob=0.5, mask_channel_selection='static', mask_length=10, mask_other=0.0, mask_prob=0.5, mask_selection='static', max_epoch=0, max_sample_size=None, max_sentences=None, max_sentences_valid=None, max_tokens=1280000, max_tokens_valid=1280000, max_update=80000, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, min_sample_size=None, model_parallel_size=1, no_epoch_checkpoints=True, no_last_checkpoints=False, no_mask_channel_overlap=False, no_mask_overlap=False, no_pretrained_weights=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, normalize=False, nprocs_per_node=1, num_workers=32, optimizer='adam', optimizer_overrides='{}', patience=-1, profile=False, quantization_config_path=None, remove_bpe='letter', required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=True, restore_file='checkpoint_last.pt', sample_rate=16000, save_dir='/content/wav2vec2exp/finetunning/ontop', save_interval=1, save_interval_updates=0, scoring='bleu', seed=2337, sentence_avg=True, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, stop_time_hours=0, task='audio_pretraining', tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, tpu=False, train_subset='train', update_freq=[1], use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=10000, validate_interval=1, validate_interval_updates=0, w2v_path='/content/wav2vec2exp/pretraining/ontop/checkpoint_best.pt', warmup_steps=8000, weight_decay=0.0, wer_args='("/content/language_model.bin","/content/lexicon.lst",2,-1)', zero_infinity=True, zero_sharding='none')
2020-09-04 12:04:04 | INFO | fairseq.data.audio.raw_audio_dataset | loaded 153, skipped 0 samples

please let me know if you spot any mistakes here.

lematt1991 commented 4 years ago

CC @alexeib

alexeib commented 4 years ago

can you try decoding your previous checkpoint (even if it gets 100 wer) with this 4gram lm using infer.py as in the example and see if it gives you any better error messages?

amant555 commented 4 years ago

This is the output that I got when I ran infer.py and the WER stays at 100 when I tried fine-tuning with 1hr, 2hrs and 3hrs but its doesn't goes down. It moves up and come down to 100 but never goes below that. The behaviour for infer is same as in the train the execution ends without any error. I can share the lexicon and dict file but I made them as instructed in some of the previous issues.

INFO:__main__:Namespace(all_gather_list_size=16384, beam=5, beam_size_token=100, beam_threshold=25.0, bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_suffix='', constraints=None, cpu=False, criterion='ctc', data='/content', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoding_format=None, device_id=0, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=None, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', diverse_beam_groups=-1, diverse_beam_strength=0.5, diversity_rate=-1.0, dump_emissions=None, dump_features=None, empty_cache_freq=0, enable_padding=False, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='valid', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, iter_decode_with_beam=1, iter_decode_with_external_reranker=False, kenlm_model='/content/language_model.bin', kspmodel=None, labels='ltr', lenpen=1, lexicon='/content/lexicon.lst', lm_weight=2.0, load_emissions=None, localsgd_frequency=3, log_format=None, log_interval=100, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sample_size=None, max_sentences=None, max_tokens=4000000, memory_efficient_bf16=False, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, min_sample_size=None, model_overrides='{}', model_parallel_size=1, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, no_seed_provided=True, normalize=False, nprocs_per_node=1, num_shards=1, num_workers=1, optimizer=None, path='/content/wav2vec2exp/finetunning/ontop/checkpoint_best.pt', pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=None, pipeline_devices=None, pipeline_model_parallel=False, prefix_size=0, print_alignment=False, print_step=False, profile=False, quantization_config_path=None, quiet=False, remove_bpe='letter', replace_unk=None, required_batch_size_multiple=8, results_path='/content/test', retain_dropout=False, retain_dropout_modules=None, retain_iter_history=False, rnnt_decoding_type='greedy', rnnt_len_penalty=-0.5, sacrebleu=False, sample_rate=16000, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, scoring='bleu', seed=1, shard_id=0, sil_weight=0.0, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, task='audio_pretraining', temperature=1.0, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, tpu=False, unit_lm=False, unk_weight=-inf, unkpen=0, unnormalized=False, user_dir=None, w2l_decoder='kenlm', warmup_updates=0, wer_args=None, wfstlm=None, word_score=-1.0, zero_infinity=False, zero_sharding='none')
INFO:fairseq.data.audio.raw_audio_dataset:loaded 153, skipped 0 samples
INFO:__main__:| /content valid 153 examples
INFO:__main__:| decoding with criterion ctc
INFO:__main__:| loading model(s) from /content/wav2vec2exp/finetunning/ontop/checkpoint_best.pt

alexeib commented 4 years ago

sorry i am not sure how i can help you here. are you trying to finetune a model that has already been finetuned (i.e. it is not the "no finetuning model")? if so you need to reset a bunch of other things as well, most importantly --reset-lr-scheduler, maybe also --reset-meters

ezerhouni commented 4 years ago

I am facing the same issue. I am trying to fine-tuned a pre-trained model (Wav2Vec 2.0 Base -- No Finetuning) with 1h of Libri-light (i.e trying to replicate the experiment from the paper), but the WER stays at 100.

python3 path/to/fairseq/train.py \ --distributed-world-size 6 /path/to/libri-light/1h \ --save-dir save/dir \ --fp16 \ --wer-args '("path/lm_librispeech_kenlm_word_4g_200kvocab.bin","path/librispeech_lexicon.lst",2,-1)' \ --post-process letter \ --valid-subset valid \ --no-epoch-checkpoints \ --best-checkpoint-metric wer \ --num-workers 4 \ --max-update 13000 \ --sentence-avg \ --task audio_pretraining \ --arch wav2vec_ctc \ --w2v-path path/wav2vec_small.pt \ --labels ltr \ --apply-mask \ --mask-selection static \ --mask-other 0 \ --mask-length 10 \ --mask-prob 0.65 \ --layerdrop 0.1 \ --mask-channel-selection static \ --mask-channel-other 0 \ --mask-channel-length 64 \ --mask-channel-prob 0.256 \ --zero-infinity \ --feature-grad-mult 0.0 \ --freeze-finetune-updates 10000 \ --validate-after-updates 10000 \ --optimizer adam \ --adam-betas '(0.9, 0.98)' \ --adam-eps 1e-08 \ --lr 2e-05 \ --lr-scheduler tri_stage \ --warmup-steps 1300 \ --hold-steps 5200 \ --decay-steps 6500 \ --final-lr-scale 0.05 \ --final-dropout 0.0 \ --dropout 0.0 \ --activation-dropout 0.1 \ --criterion ctc \ --attention-dropout 0.0 \ --max-tokens 1280000 \ --seed 2337 \ --log-format json \ --log-interval 500 \ --ddp-backend no_c10d

Let me know if you see something wrong. Thank you for your help

alexeib commented 4 years ago

how about "raw_wer" or "uer"? if those are not 100 then something is wrong with your lexicon or lm (are they upper cased for example?)

amant555 commented 4 years ago

sorry i am not sure how i can help you here. are you trying to finetune a model that has already been finetuned (i.e. it is not the "no finetuning model")? if so you need to reset a bunch of other things as well, most importantly --reset-lr-scheduler, maybe also --reset-meters

I am fine tuning a model from scratch i.e (model without fine-tuning). The lexicon and everything that I have created were put in the upper case already. But the wer remain at 100.

Is it possible if I am training for a language other than English whose ASCII values are far higher than English and that's causing problems in fine-tuning of model. If that's the case what approach shall I try? But that still don't explain the reson for the failure in fine-tuning with language model.

alexeib commented 4 years ago

so wait, you are finetuning for a language other than english, using a model pretrained on english audio books? if so, then it is not too surprising that this does not work well...

amant555 commented 4 years ago

No I trained my model on my language. I used the already trained model as a base and then started training model again on my language. It gave a boost in learning and when the loss stopped decreasing and accuracy reached around 90's. I started fine-tuning. I trained a model from scratch too. Both have same problem with fine-tuning.

I can share the files of lexicon and the text file I used to create lm.arpa and lm.bin. And I made sure that files like labels are also created correctly. Space seperated and and each word ending with |.

ezerhouni commented 4 years ago

I have open a different issue (https://github.com/pytorch/fairseq/issues/2685) since my issue sems different (I am fine tuning on english)

alexeib commented 4 years ago

accuracy reaching 90% sounds suspicious. we get accuracies around 60-70% with our best and biggest models at the moment (unless you are not using quantized targets?) maybe something is wrong with your pretraining hyperparams

amant555 commented 4 years ago

It did stuck at 60-70 range. At that time didn't had any idea that that's the max it would reach so I changed some params in pre-training after a long time it started increasing till 90.

I am fine tuning on 10hrs of data. Can the reason be the batch size or the less amount of data in fine-tuning. Or any other param in fine-tuning

I tried fine-tuning at points where accuracy were: 50, 60, 68... But wer was always at 100.

alexeib commented 4 years ago

so wait, you modified hyperparams to make the pretraining task easier, and now you are taking earlier checkpoints of that same run where it has not yet converged? that is unlikely to work. can you maybe try pretraining with the original hyper params? also your audio is single channel 16khz right?

amant555 commented 4 years ago

Yes all the files are in single channel and 16khz format with length of 15-30 sec. And I used the original params till it reached 58.9 to be exact. After that for next 100 epochs it remained same then I saved that point and as I didn't know to which point I should train, I started training again with changed params.

Meanwhile I used the 58.9 one that I trained using original params to fine-tuning without LM. It remained at 100. And then I used LM and code ended without any specific error. I might be able to find and correct the failure of fine tuning with lm. But I have no clue why WER is not going down from 100.

alexeib commented 4 years ago

i see. i am not sure why it doesnt work as our experiments in XLSR paper showed that the same models with same params can work well on other languages. i'm afraid i cant help much as this would require debugging with your particular dataset. i would also try to figure out why training with lm doesnt work - which again requires your specific setup.

amant555 commented 4 years ago

Can you help out with 3 questions these may help me to formulate a plan to move forward.

At around what time your team used wav2letter python bindings. I can start by cloning and installing those binding. To start debugging the fine tuning with LM part.
Lexicon are not required in case of kenlm decoder right? If they are, then are they the pair of words in vocab of fine-tuning and each character seperated by space?
To debug my dataset if I train from scratch on a small set of 20 hrs on pre-training and 2hrs on fine-tuning. Will I get some results? Otherwise will use whole dataset.

alexeib commented 4 years ago

i used commit 9e2c69f which was from feb 25. however, it should work with the latest version from master as well.
lexicon in theory is not required but the current fairseq wav2letter integration assumes that you have it. you would have to modify the code and use LexiconFreeDecoder instead
20hr pretraining seems very little. a better strategy is to use the whole dataset but cap the number of updates that you do so you can iterate quickly. i would train at least 100k updates on at least 16 gpus to get meaningful results with 2h of labels (although this may not be enough depending on the dataset etc)

wahyubram82 commented 4 years ago

I am facing the same issue. I am trying to fine-tuned a pre-trained model (Wav2Vec 2.0 Base -- No Finetuning) with 1h of Libri-light (i.e trying to replicate the experiment from the paper), but the WER stays at 100.

python3 path/to/fairseq/train.py \ --distributed-world-size 6 /path/to/libri-light/1h \ --save-dir save/dir \ --fp16 \ --wer-args '("path/lm_librispeech_kenlm_word_4g_200kvocab.bin","path/librispeech_lexicon.lst",2,-1)' \ --post-process letter \ --valid-subset valid \ --no-epoch-checkpoints \ --best-checkpoint-metric wer \ --num-workers 4 \ --max-update 13000 \ --sentence-avg \ --task audio_pretraining \ --arch wav2vec_ctc \ --w2v-path path/wav2vec_small.pt \ --labels ltr \ --apply-mask \ --mask-selection static \ --mask-other 0 \ --mask-length 10 \ --mask-prob 0.65 \ --layerdrop 0.1 \ --mask-channel-selection static \ --mask-channel-other 0 \ --mask-channel-length 64 \ --mask-channel-prob 0.256 \ --zero-infinity \ --feature-grad-mult 0.0 \ --freeze-finetune-updates 10000 \ --validate-after-updates 10000 \ --optimizer adam \ --adam-betas '(0.9, 0.98)' \ --adam-eps 1e-08 \ --lr 2e-05 \ --lr-scheduler tri_stage \ --warmup-steps 1300 \ --hold-steps 5200 \ --decay-steps 6500 \ --final-lr-scale 0.05 \ --final-dropout 0.0 \ --dropout 0.0 \ --activation-dropout 0.1 \ --criterion ctc \ --attention-dropout 0.0 \ --max-tokens 1280000 \ --seed 2337 \ --log-format json \ --log-interval 500 \ --ddp-backend no_c10d

Let me know if you see something wrong. Thank you for your help

how to create the lexicon file (path/librispeech_lexicon.lst). I use my own dataset not the libri...

alexeib commented 4 years ago

go through your lm or dataset and build a similar lexicon as the one for librispeech. e.g. for letter targets you have each word, tab, and then letters separated by space with a word boundary token at the end e.g. HELLO \ H E L L O | ...

facebookresearch / fairseq

Wav2vec 2.0 fine-tuning with language model code doesn't runs #2573