facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.55k stars 6.41k forks source link

hi #5408

Closed elisonlau closed 10 months ago

elisonlau commented 11 months ago

❓ Questions and Help

HI @xuqiantong, @alexeib, @michaelauli, I got inspiration of your recent paper Simple and Effective Zero-shot Cross-lingual Phoneme Recognition ,and "transferred" it on my data to recognize mandarin phoneme with tone. But the result is so bad, and the wer(actually per) is always over 60, and a lot of [unk]s present. The below is a piece of result.

Screenshot 2023-12-22 at 00 14 44

No matter Wav2Vec 2.0 Large or wav2vec 2.0 (XLSR) models as my pre-train model, the result is similar.

Question is : there is something wrong with my procedure or this is so hard to get Chinese tone classification in fact whatever using any wav2vec2 pre-train model?

What have you tried?

I'll try to pretty detailed with my setup :

I created the .tsv,.phn, dict.phn.txt files using the steps @alexeib provides in issue https://github.com/pytorch/fairseq/issues/2922#issuecomment-731308318 using the wav2vec_manifest.py file. The below picture shows a piece of dict.phn.txt

Screenshot 2023-12-22 at 00 33 39

I was able to finetune the model using this train command:

python3 $FAIRSEQ_PATH/fairseq_cli/hydra_train.py \ distributed_training.distributed_port=0 \ task.labels=phn\ task.data=$DATASET \ dataset.valid_subset=$valid_subset \ distributed_training.distributed_world_size=1 \ model.w2v_path=$model_path \ hydra.run.dir=/content/drive/MyDrive/outputs \ +restore_file=/content/drive/MyDrive/outputs/checkpoints/checkpoint_last.pt \ --config-dir $config_dir \ --config-name $config_name

Config file is based on base_10h.yaml,and do some simple modifications like: MAX-TOKENS: 1000000,MAX-UPDATES:50000. I have just only single one GPU of A100 with mem40G, that no distribute for training.

Here is my command for running evaluation:

python3 $FAIRSEQ_PATH/examples/speech_recognition/infer.py $DATASET --task audio_finetuning \ --nbest 1 --path /content/drive/MyDrive/outputs/checkpoints/checkpoint_best.pt --gen-subset dev_other --results-path $DATASET --w2l-decoder viterbi \ --criterion ctc --labels phn --max-tokens 1800000

The above is my general procedure, I need your help! Thanks and looking forward to your reply! Elison

xuqiantong commented 11 months ago

<unk>s in TARGET indicating that the target phonemes you provided are not always covered by the ones in the dictionary you used to train the model, i.e. the dict.phn.txt in your model or data dir. You have to make sure the target phonemes are properly mapped to the ones you used to train your model.

elisonlau commented 10 months ago

<unk>s in TARGET indicating that the target phonemes you provided are not always covered by the ones in the dictionary you used to train the model, i.e. the dict.phn.txt in your model or data dir. You have to make sure the target phonemes are properly mapped to the ones you used to train your model.


@xuqiantong thanks so much, I got so stupid mistake.. with your point i corrected it and succeeded. Further more, I found the "examples/speech_recognition/infer.py" is so complex, through search i found with transformers architecture may simple this procedure, but so difficult transfer fairseq to transformers. Do you have some suggestion about that?! If "MUST" to use transformers, I found missing the config.json---how can I get this file.

Thanks and Looking forward U reply