The loss value is extremely high when fine-tuning the xm_transformer_unity model. Is there a wrong step?

❓ Questions and Help

What is your question?

Additionally, I also encountered some problems like AssertionError: Optimizer does not match; please reset the optimizer (--reset-optimizer). FP16Optimizer vs FairseqAdam and exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1) RuntimeError: The size of tensor a (755724736) must match the size of tensor b (931956160) at non-singleton dimension 0 and gradient overflow. I guess maybe there are some problems in my custom datasets? But when I dont use the--fp16` in the running command it works.

In fact, I'm not if my steps are correct. So I hope to seek some help. Thank you!

Code

This is my command： fairseq-train /home/s2ut/FormattingData/DATA_ROOT \ --config-yaml/home/s2ut/FormattingData/DATA_ROOT/config.yaml \ --multitask-config-yaml /home/s2ut/FormattingData/DATA_ROOT/multitask_config.yaml \ --task speech_to_text --arch xm_transformer_t2 \ --criterion speech_to_unit_translatotron2 --label-smoothing 0.1 \ --share-decoder-input-output-embed --adaptor-n-layers 1 --normalize \ --dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \ --train-subset train --valid-subset dev \ --load-pretrained-decoder-from /root/autodl-tmp/code/trained_model/checkpoint_last.pt --w2v-path /root/autodl-tmp/code/trained_model/checkpoint_last.pt \ --mask-prob 0.3 --mask-channel-length 32 --mask-channel-prob 0.25 \ --save-dir /root/autodl-tmp/code/trained_model --checkpoint-activations --encoder-proj \ --lr 0.00000001 --dropout 0.1 --attention-dropout 0.1 --lr-scheduler inverse_sqrt \ --warmup-init-lr 1e-7 --warmup-updates 2000 \ --optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 10.0 \ --max-update 80000 --max-tokens 5000 --max-tokens-valid 5000 --max-source-positions 5000 \ --max-target-positions 5000 --update-freq 1 \ --seed 1234 --num-workers 1 \ --reset-dataloader --reset-optimizer --batch-size 16 --max-epoch 1000 --save-interval 1000

What have you tried?

Firstly, I prepare manifest file by python examples/wav2vec/wav2vec_manifest.py /home/s2ut/TGT_AUDIO/train --dest /home/s2ut/TGT_AUDIO/train --ext wav --valid-percent 0

Secondly, I run the command python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py --feature_type hubert \ --kmeans_model_path /home/s2ut/mhubert_base_vp_en_es_fr_it3_L11_km1000.bin --acoustic_model_path /home/s2ut/mhubert_base_vp_en_es_fr_it3.pt \ --layer 11 --manifest_path /home/s2ut/TGT_AUDIO/train/train.tsv \ --out_quantized_file_path /home/s2ut/TGT_AUDIO/train.txt --extension ".wav" to extract units by mhubert_base_vp_en_es_fr_it3_L11_km1000 released in https://github.com/facebookresearch/fairseq/blob/ust/examples/speech_to_speech/docs/textless_s2st_real_data.md.

Then, I formate data by python examples/speech_to_speech/preprocessing/prep_s2ut_data.py \ --source-dir /home/s2ut/SRC_AUDIO --target-dir /home/s2ut/TGT_AUDIO \ --data-split train dev --output-root /home/s2ut/FormattingData/DATA_ROOT \ --reduce-unit --vocoder-checkpoint /home/s2ut/g_00500000 \ --vocoder-cfg /home/s2ut/vocoder_code_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_config.json to get a config.yaml

My task data format is like the following： id audio n_frames tgt_text tgt_n_frames 26 /home/s2ut/SRC_AUDIO/train/26.wav 547 864 497 248 16 /home/s2ut/SRC_AUDIO/train/16.wav 445 39 6 54 192 232

I used bpe to generate subwords and find the subwords'id in en_zh_spm.dict, and I write these tokens in tgt_text of multitask My multitask data format is like the following：

id tgt_text 26 3476765 2692239 80799 68322236 16 36544 38935 372148

To recongize new words, so I replace some original words in the dict. May I ask if this tgt_text should be texts or tokens?

My task file: input_channels: 1 input_feat_per_channel: 80 specaugment: freq_mask_F: 27 freq_mask_N: 1 time_mask_N: 1 time_mask_T: 100 time_mask_p: 1.0 time_wrap_W: 0 transforms: '*':

utterance_cmvn _train:
utterance_cmvn
specaugment vocoder: checkpoint: /home/s2ut/g_00500000 config: /home/s2ut/vocoder_code_hifigan_hubert_base_100_lj_config.json type: code_hifigan decoder_type: transformer decoder_layer: 2 encoder_layer: 1 loss_weight: 8.0 prepend_bos_and_append_tgt_lang_tag: true eos_token: rdrop_alpha: 10.0 tgt_lang: dict: /home/s2ut/FormattingData/DATA_ROOT/dict.txt standardize_audio: true use_audio_input: false apply_ucmvn: true

My multitask file: target_letter: target_type: text decoder_type: transformer encoder_layer: 1 loss_weight: 8.0 prepend_bos_and_append_tgt_lang_tag: true eos_token: "[en_XX]" rdrop_alpha: 10.0 data: /home/s2ut/FormattingData/DATA_ROOT/target_letter tgt_lang: src_lang: dict: /home/s2ut/FormattingData/DATA_ROOT/en_zh_spm.dict standardize_audio: true use_audio_input: true apply_ucmvn: true

What's your environment?

I use python 3.8, fairseq ust 0.12.0 on Linux

facebookresearch / fairseq