I hope to finetune the xm_transformer_unity pre-trained model so that its mbart decoder can recognize some new words.
I followed
https://github.com/facebookresearch/fairseq/blob/ust/examples/speech_to_speech/docs/enhanced_direct_s2st_discrete_units.md and https://github.com/facebookresearch/fairseq/ust/examples/speech_to_speech/docs/enhanced_direct_s2st_discrete_units.md.
But is it normal for me to have a high loss value during training? It's loss reached 50 and multitask loss reached 1000. Here are some console outputs:
024-08-07 15:20:51 | INFO | dev | epoch 163 | valid on 'dev' subset | loss 50.401 | nll_loss 18.932 | multitask_target_letter_loss 918.037 | ppl 500238 | wps 0 | wpb 470 | bsz 2 | multitask_target_letter_loss_weight 8 | num_updates 652
2024-08-07 15:20:51 | INFO | fairseq_cli.train | end of epoch 163 (average epoch stats below)
2024-08-07 15:20:51 | INFO | train | epoch 163 | loss 65.175 | nll_loss 19.787 | total None | n_correct None | multitask_target_letter_loss 1303.92 | ppl 904912 | wps 791.8 | ups 0.55 | wpb 1440 | bsz 6.2 | num_updates 652 | multitask_target_letter_loss_weight 8 | lr 7.066e-08 | gnorm 1667.83 | clip 100 | loss_scale None | train_wall 7 | gb_free None | cuda_gb_allocated 16.9 | cuda_gb_reserved 22.1 | cuda_gb_free 22.5 | wall 0
Additionally, I also encountered some problems like AssertionError: Optimizer does not match; please reset the optimizer (--reset-optimizer). FP16Optimizer vs FairseqAdam and exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1) RuntimeError: The size of tensor a (755724736) must match the size of tensor b (931956160) at non-singleton dimension 0 and gradient overflow. I guess maybe there are some problems in my custom datasets? But when I dont use the--fp16` in the running command it works.
In fact, I'm not if my steps are correct. So I hope to seek some help. Thank you!
Firstly, I prepare manifest file by python examples/wav2vec/wav2vec_manifest.py /home/s2ut/TGT_AUDIO/train --dest /home/s2ut/TGT_AUDIO/train --ext wav --valid-percent 0
Secondly, I run the command python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py --feature_type hubert \ --kmeans_model_path /home/s2ut/mhubert_base_vp_en_es_fr_it3_L11_km1000.bin --acoustic_model_path /home/s2ut/mhubert_base_vp_en_es_fr_it3.pt \ --layer 11 --manifest_path /home/s2ut/TGT_AUDIO/train/train.tsv \ --out_quantized_file_path /home/s2ut/TGT_AUDIO/train.txt --extension ".wav" to extract units by mhubert_base_vp_en_es_fr_it3_L11_km1000 released in https://github.com/facebookresearch/fairseq/blob/ust/examples/speech_to_speech/docs/textless_s2st_real_data.md.
Then, I formate data by python examples/speech_to_speech/preprocessing/prep_s2ut_data.py \ --source-dir /home/s2ut/SRC_AUDIO --target-dir /home/s2ut/TGT_AUDIO \ --data-split train dev --output-root /home/s2ut/FormattingData/DATA_ROOT \ --reduce-unit --vocoder-checkpoint /home/s2ut/g_00500000 \ --vocoder-cfg /home/s2ut/vocoder_code_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_config.json to get a config.yaml
My task data format is like the following:
id audio n_frames tgt_text tgt_n_frames
26 /home/s2ut/SRC_AUDIO/train/26.wav 547 864 497 248
16 /home/s2ut/SRC_AUDIO/train/16.wav 445 39 6 54 192 232
I used bpe to generate subwords and find the subwords'id in en_zh_spm.dict, and I write these tokens in tgt_text of multitask
My multitask data format is like the following:
❓ Questions and Help
What is your question?
I hope to finetune the xm_transformer_unity pre-trained model so that its mbart decoder can recognize some new words. I followed https://github.com/facebookresearch/fairseq/blob/ust/examples/speech_to_speech/docs/enhanced_direct_s2st_discrete_units.md and https://github.com/facebookresearch/fairseq/ust/examples/speech_to_speech/docs/enhanced_direct_s2st_discrete_units.md. But is it normal for me to have a high loss value during training? It's loss reached 50 and multitask loss reached 1000. Here are some console outputs: 024-08-07 15:20:51 | INFO | dev | epoch 163 | valid on 'dev' subset | loss 50.401 | nll_loss 18.932 | multitask_target_letter_loss 918.037 | ppl 500238 | wps 0 | wpb 470 | bsz 2 | multitask_target_letter_loss_weight 8 | num_updates 652 2024-08-07 15:20:51 | INFO | fairseq_cli.train | end of epoch 163 (average epoch stats below)
2024-08-07 15:20:51 | INFO | train | epoch 163 | loss 65.175 | nll_loss 19.787 | total None | n_correct None | multitask_target_letter_loss 1303.92 | ppl 904912 | wps 791.8 | ups 0.55 | wpb 1440 | bsz 6.2 | num_updates 652 | multitask_target_letter_loss_weight 8 | lr 7.066e-08 | gnorm 1667.83 | clip 100 | loss_scale None | train_wall 7 | gb_free None | cuda_gb_allocated 16.9 | cuda_gb_reserved 22.1 | cuda_gb_free 22.5 | wall 0
Additionally, I also encountered some problems like
AssertionError: Optimizer does not match; please reset the optimizer (--reset-optimizer). FP16Optimizer vs FairseqAdam
andexp_avg.mul_(beta1).add_(grad, alpha=1 - beta1) RuntimeError: The size of tensor a (755724736) must match the size of tensor b (931956160) at non-singleton dimension 0
andgradient overflow
. I guess maybe there are some problems in my custom datasets? But when I dont use the
--fp16` in the running command it works.In fact, I'm not if my steps are correct. So I hope to seek some help. Thank you!
Code
This is my command:
fairseq-train /home/s2ut/FormattingData/DATA_ROOT \ --config-yaml/home/s2ut/FormattingData/DATA_ROOT/config.yaml \ --multitask-config-yaml /home/s2ut/FormattingData/DATA_ROOT/multitask_config.yaml \ --task speech_to_text --arch xm_transformer_t2 \ --criterion speech_to_unit_translatotron2 --label-smoothing 0.1 \ --share-decoder-input-output-embed --adaptor-n-layers 1 --normalize \ --dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \ --train-subset train --valid-subset dev \ --load-pretrained-decoder-from /root/autodl-tmp/code/trained_model/checkpoint_last.pt --w2v-path /root/autodl-tmp/code/trained_model/checkpoint_last.pt \ --mask-prob 0.3 --mask-channel-length 32 --mask-channel-prob 0.25 \ --save-dir /root/autodl-tmp/code/trained_model --checkpoint-activations --encoder-proj \ --lr 0.00000001 --dropout 0.1 --attention-dropout 0.1 --lr-scheduler inverse_sqrt \ --warmup-init-lr 1e-7 --warmup-updates 2000 \ --optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 10.0 \ --max-update 80000 --max-tokens 5000 --max-tokens-valid 5000 --max-source-positions 5000 \ --max-target-positions 5000 --update-freq 1 \ --seed 1234 --num-workers 1 \ --reset-dataloader --reset-optimizer --batch-size 16 --max-epoch 1000 --save-interval 1000
What have you tried?
Firstly, I prepare manifest file by
python examples/wav2vec/wav2vec_manifest.py /home/s2ut/TGT_AUDIO/train --dest /home/s2ut/TGT_AUDIO/train --ext wav --valid-percent 0
Secondly, I run the command
python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py --feature_type hubert \ --kmeans_model_path /home/s2ut/mhubert_base_vp_en_es_fr_it3_L11_km1000.bin --acoustic_model_path /home/s2ut/mhubert_base_vp_en_es_fr_it3.pt \ --layer 11 --manifest_path /home/s2ut/TGT_AUDIO/train/train.tsv \ --out_quantized_file_path /home/s2ut/TGT_AUDIO/train.txt --extension ".wav"
to extract units by mhubert_base_vp_en_es_fr_it3_L11_km1000 released in https://github.com/facebookresearch/fairseq/blob/ust/examples/speech_to_speech/docs/textless_s2st_real_data.md.Then, I formate data by
python examples/speech_to_speech/preprocessing/prep_s2ut_data.py \ --source-dir /home/s2ut/SRC_AUDIO --target-dir /home/s2ut/TGT_AUDIO \ --data-split train dev --output-root /home/s2ut/FormattingData/DATA_ROOT \ --reduce-unit --vocoder-checkpoint /home/s2ut/g_00500000 \ --vocoder-cfg /home/s2ut/vocoder_code_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_config.json
to get aconfig.yaml
My task data format is like the following: id audio n_frames tgt_text tgt_n_frames 26 /home/s2ut/SRC_AUDIO/train/26.wav 547 864 497 248 16 /home/s2ut/SRC_AUDIO/train/16.wav 445 39 6 54 192 232
I used bpe to generate subwords and find the subwords'id in en_zh_spm.dict, and I write these tokens in tgt_text of multitask My multitask data format is like the following:
id tgt_text 26 3476765 2692239 80799 68322236 16 36544 38935 372148
To recongize new words, so I replace some original words in the dict. May I ask if this tgt_text should be texts or tokens?
My task file: input_channels: 1 input_feat_per_channel: 80 specaugment: freq_mask_F: 27 freq_mask_N: 1 time_mask_N: 1 time_mask_T: 100 time_mask_p: 1.0 time_wrap_W: 0 transforms: '*':
My multitask file: target_letter: target_type: text decoder_type: transformer encoder_layer: 1 loss_weight: 8.0 prepend_bos_and_append_tgt_lang_tag: true eos_token: "[en_XX]" rdrop_alpha: 10.0 data: /home/s2ut/FormattingData/DATA_ROOT/target_letter tgt_lang:
src_lang:
dict: /home/s2ut/FormattingData/DATA_ROOT/en_zh_spm.dict
standardize_audio: true
use_audio_input: true
apply_ucmvn: true
What's your environment?
I use python 3.8, fairseq ust 0.12.0 on Linux