facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
29.79k stars 6.3k forks source link

[Speech to speech translation with discrete units] produce almost the same audios from different test audios during inference #5003

Open LaHeriody opened 1 year ago

LaHeriody commented 1 year ago

❓ Questions and Help

I follow the doc here to do speech to speech translation with discrete units; firstly, I prepare target units use

python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py \
    --feature_type $TYPE \
    --kmeans_model_path $KM_MODEL_PATH \
    --acoustic_model_path $CKPT_PATH \
    --layer $LAYER \
    --manifest_path $MANIFEST \
    --out_quantized_file_path $OUT_QUANTIZED_FILE \
    --extension ".wav"

and I get test.txt/train.txt/valid.txt like this

common_voice_zh-CN_19112438.mp3|71 71 93 82 11 45 64 37 37 86 68 68 16 74 27 47 5 5 30 30 70 70 52 25 25 11 45 64 74 27 21 95 95 23 53 53 62 29 28 28 28 87 24 46 30 30 70 70 70 52 52 52 48 48 51 51 19 19 19 19 66 60 27 63 47 76 58 58 58 65 74 27 21 21 95 95 95 45 45 45 45 64 64 64 65 3 3 77 15 15 15 15 15 41 41 84 63 96 20 20

Secondly, I run the script below

DATA_ROOT=/home/lbh/re_split_dataset
SRC_AUDIO=${DATA_ROOT}/processed_zh_dataset
TGT_AUDIO=${DATA_ROOT}/processed_en_dataset
SPLIT1=train
SPLIT2=valid
SPLIT3=test
python /home/lbh/fairseq/examples/speech_to_speech/preprocessing/prep_s2ut_data.py \
  --source-dir $SRC_AUDIO --target-dir $TGT_AUDIO --data-split $SPLIT1 $SPLIT2 $SPLIT3 \
  --output-root $DATA_ROOT --reduce-unit \
  --vocoder-checkpoint $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG

and I get test.tsv/train.tsv/valid.tsv as shown below

id  src_audio   src_n_frames    tgt_audio   tgt_n_frames
common_voice_zh-CN_19112438.mp3 /home/lbh/re_split_dataset/processed_zh_dataset/test/common_voice_zh-CN_19112438.mp3.wav    417 71 93 82 11 45 64 37 86 68 16 74 27 47 5 30 70 52 25 11 45 64 74 27 21 95 23 53 62 29 28 87 24 46 30 70 52 48 51 19 66 60 27 63 47 76 58 65 74 27 21 95 45 64 65 3 77 15 41 84 63 96 20 62

I don't do Multitask data, I follow the script below to train my zh-en model

fairseq-train $DATA_ROOT \
  --config-yaml config.yaml \
  --task speech_to_speech --target-is-code --target-code-size 100 --vocoder code_hifigan  \
  --criterion speech_to_unit --label-smoothing 0.2 \
  --arch s2ut_transformer_fisher --share-decoder-input-output-embed \
  --dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \
  --train-subset train --valid-subset dev \
  --save-dir ${MODEL_DIR} \
  --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-7 --warmup-updates 10000 \
  --optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 10.0 \
  --max-update 400000 --max-tokens 20000 --max-target-positions 3000 --update-freq 4 \
  --seed 1 --fp16 --num-workers 8

after that, there exist checkpoint_best.pt in the $MODEL_DIR.

Inference step: I ran

fairseq-generate $DATA_ROOT \
  --config-yaml config.yaml \
  --task speech_to_speech --target-is-code --target-code-size 100 --vocoder code_hifigan \
  --path $MODEL_DIR/checkpoint_best.pt  --gen-subset $GEN_SUBSET \
  --max-tokens 50000 \
  --beam 10 --max-len-a 1 \
  --results-path ${RESULTS_PATH}

and I get generate-test.txt like this

T-1771  71 89 59 38 44 18 31 59 33 97 51 19 90 35 11 64 81 84 63 96 55 39 67 54 63 93 75 91 9 29 28 92 50 87 9 44 80 85 11 64 66 27 31 53 65 3 77 5 30 44 80 74 2 3 48 46 30 16 18 29 28 23 73 3 77 52 25 13 58 32 1 85 42 88 81 83 96 55 39 67 54 63 86 51 65 6 36 7 97 44 80 26 87 97 44 80 10 37 86 9 62 6 36 92 27 63 89 59 38 44 18 27 31 60 33 48 51 19 90 35 42 11 64 81 83 84 63 96 55 34 56 72 40 72 89 59 53 44 80 18 27 31 59 33 51 19 90 35 11 64 81 83 63 84 96 55 67 54 40 72 40 72 21 95 53 44 80 18 27 31 59 33 51 19 90 35 11 64 81 83 84 96 55 67 54 63 40 93 63 89 87 38 44 80 18 31 59 33 51 19 90 35 11 81 83 63 20
H-1771  -0.6879629492759705 71 72 86 53 44 80 82 62 6 36 7 87 9 16 77 23 44 18 99 82 99 98 0 30 25 73 16 77 66 27 21 95 87 24 61 58 9 1 21 95 23 42 88 81 83 63 96 55 39 67 54 63 82 73 70 14 68 44 80 85 75 33 68 44 18 85 5 1 85 23 44 80 18 6 36 7 87 9 16 77 44 18 85 11 64 65 99 3 82 87 5 30 1 66 63 78 52 25 94 32 1 85 73 16 77 66 27 21 95 23 53 44 80 18 21 95 11 64 81 84 96 55 39 67 54 63 86 53 44 80 82 73 62 99 3 82 87 5 30 1 66 63 78 52 25 94 32 1 85 73 16 77 66 63 21 95 53 44 80 18 21 95 11 64 81 83 20
D-1771  -0.6879629492759705 71 72 86 53 44 80 82 62 6 36 7 87 9 16 77 23 44 18 99 82 99 98 0 30 25 73 16 77 66 27 21 95 87 24 61 58 9 1 21 95 23 42 88 81 83 63 96 55 39 67 54 63 82 73 70 14 68 44 80 85 75 33 68 44 18 85 5 1 85 23 44 80 18 6 36 7 87 9 16 77 44 18 85 11 64 65 99 3 82 87 5 30 1 66 63 78 52 25 94 32 1 85 73 16 77 66 27 21 95 23 53 44 80 18 21 95 11 64 81 84 96 55 39 67 54 63 86 53 44 80 82 73 62 99 3 82 87 5 30 1 66 63 78 52 25 94 32 1 85 73 16 77 66 63 21 95 53 44 80 18 21 95 11 64 81 83 20
P-1771  -0.2670 -2.3002 -1.9321 -1.2817 -1.1338 -0.4314 -0.8351 -1.7154 -0.9658 -0.2577 -0.3137 -0.5839 -1.5242 -1.3634 -0.3085 -1.7460 -0.3079 -0.1976 -1.8827 -1.1614 -0.2611 -0.7000 -0.1434 -0.2306 -0.1416 -0.2650 -0.3604 -0.7388 -0.7650 -1.0839 -0.6327 -0.2492 -0.5388 -0.5828 -0.4113 -3.3100 -0.4407 -0.1338 -0.3206 -0.2418 -0.1354 -0.5031 -0.2432 -0.2077 -1.1272 -1.0915 -0.4812 -0.2259 -0.4384 -0.2544 -0.2873 -0.6774 -3.1711 -1.5131 -1.9929 -0.3056 -0.8888 -0.3013 -0.3748 -1.0500 -0.2637 -0.5862 -0.1202 -0.2354 -0.7606 -0.1075 -0.1449 -0.6325 -0.3661 -0.2559 -0.6310 -0.6219 -0.5600 -3.2046 -0.5241 -0.1095 -1.2170 -0.6692 -0.3835 -0.1399 -1.4203 -0.4034 -0.4851 -0.2057 -0.3796 -0.4688 -1.1671 -1.0333 -0.8851 -0.4676 -0.0921 -0.1708 -0.1908 -0.7281 -1.1082 -0.7786 -0.2655 -0.2442 -0.1250 -0.2622 -1.0461 -0.2859 -1.2976 -0.4142 -0.5645 -0.4539 -1.2490 -0.3750 -0.2678 -1.7468 -0.1950 -0.1964 -0.8918 -0.2242 -0.1082 -0.2253 -0.2370 -0.3357 -1.0039 -1.9815 -1.0564 -0.4477 -0.8445 -0.2601 -0.2475 -0.6835 -1.7813 -0.9894 -0.7588 -0.4364 -0.5355 -0.6941 -2.5266 -0.7698 -0.0898 -0.2869 -0.1010 -0.0818 -0.1783 -0.2157 -1.0999 -2.3911 -0.8705 -0.2044 -0.1794 -0.3713 -0.2659 -0.9162 -0.3322 -0.5379 -0.3697 -0.5761 -0.6804 -1.8392 -0.4565 -0.2952 -0.8720 -0.1807 -1.1145 -0.2379 -0.1910 -0.2198 -0.2330 -0.5720 -1.8214 -1.2252 -1.2773 -0.4156

I ran script below to convert unit sequences to waveform

grep "^D\-" ${RESULTS_PATH}/generate-${GEN_SUBSET}.txt | \
  sed 's/^D-//ig' | sort -nk1 | cut -f3 \
  > ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit

python examples/speech_to_speech/generate_waveform_from_code.py \
  --in-code-file ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit \
  --vocoder $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG \
  --results-path ${RESULTS_PATH} --dur-prediction

Here is my question, why totally different data in test.tsv produce almost the same audio during inference. test.tsv generated from prep_s2ut_data.py generate-test.txt generated from fairseq-generate $DATA_ROOT step (I renamed it to cache-generate-test.txt) generate-test.unit generated from Convert unit sequences to waveform step (I renamed it to cache-generate-test.unit) and some '.wav' files generated during inference step can be acquired here any help can be appreciated.

tarudesu commented 1 year ago

I have the same issue as this. Does anyone have any idea?

tarudesu commented 1 year ago

@LaHeriody Have you fixed this one successfully?

LaHeriody commented 1 year ago

@LaHeriody Have you fixed this one successfully?

Actually no, I have no idea about this issue.

tarudesu commented 1 year ago

@LaHeriody Have you fixed this one successfully?

Actually no, I have no idea about this issue.

Hope that anyone can explain! Ah, but, actually, could I have your file "config.yaml" for fairseq-generate?

LaHeriody commented 1 year ago

@tarudesu I have upload config.yaml to here, hope that can help you. By the way, may I have some .wav audios generated from your inference step?

tarudesu commented 1 year ago

@LaHeriody Thank you so much! Here is some samples from my inference (I tried to train a ja-en translation). Almost the outputs are the same (they even have a long silent sound at the end).

LaHeriody commented 1 year ago

@tarudesu I added multitask data during training step, and then use the trained model in inference step. I got some audios. sounds different but still not the exactly translated audio. hope that can help you.

tarudesu commented 1 year ago

@LaHeriody Ah, could I have your config for multitasking? Awww, still managing to fix this kind of stuff.

LaHeriody commented 1 year ago

just the same as the doc said:

source_letter:  # $TASK_NAME
   decoder_type: transformer
   dict: ${DATA_ROOT}/source_letter/dict.txt
   data: ${DATA_ROOT}/source_letter
   encoder_layer: 6
   loss_weight: 8.0
target_letter:
   decoder_type: transformer
   dict: ${DATA_ROOT}/target_letter/dict.txt
   data: ${DATA_ROOT}/target_letter
   encoder_layer: 8
   loss_weight: 8.0
decoder_target_ctc:
   decoder_type: ctc
   dict: ${DATA_ROOT}/decoder_target_ctc/dict.txt
   data: ${DATA_ROOT}/decoder_target_ctc
   decoder_layer: 3
   loss_weight: 1.6
tarudesu commented 1 year ago

ah, but actually, I'm not sure what is the _$TASKNAME and dict.txt

LaHeriody commented 1 year ago

$TASK_NAME is up to you, you can $TASK_NAME=my_task dict.txt in source_letter represent the dictionary of your source language text in target_letter and decoder_target_ctc, dict.txt represent the target language text here is a demo:

token1 frequency
token2 frequency
token3 frequency
...
tarudesu commented 1 year ago

Excuse me! It has been a long time, could I ask you if you have solved this kind of problem yet? @LaHeriody

PrabhjotKaurGosal commented 1 year ago

Hi @tarudesu - I am also working on the same problem. So,far my results are consistent with your finding where I get the same audio prediction for all samples (without doing any multitask). I am preparing the multitask data now and still trying to figure out the "how" part.

Haoheya commented 1 year ago

Excuse me! I added multitask data during training step, I get .tsv files like this:

id  tgt_text
sample_id_0 token1 token2 token3 ...
sample_id_1 token1 token2 token3 ...
...

dict.txt like this

token1 frequency
token2 frequency
token3 frequency
...

but I got an error

Traceback (most recent call last):
  File "/root/miniconda3/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/tmp/py_project/fairseq/fairseq_cli/train.py", line 574, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/tmp/py_project/fairseq/fairseq/distributed/utils.py", line 404, in call_main
    main(cfg, **kwargs)
  File "/tmp/py_project/fairseq/fairseq_cli/train.py", line 165, in main
    extra_state, epoch_itr = checkpoint_utils.load_checkpoint(
  File "/tmp/py_project/fairseq/fairseq/checkpoint_utils.py", line 279, in load_checkpoint
    epoch_itr = trainer.get_train_iterator(
  File "/tmp/py_project/fairseq/fairseq/trainer.py", line 736, in get_train_iterator
    self.reset_dummy_batch(batch_iterator.first_batch)
  File "/tmp/py_project/fairseq/fairseq/data/iterators.py", line 372, in first_batch
    return self.collate_fn([self.dataset[i] for i in self.frozen_batches[0]])
  File "/tmp/py_project/fairseq/fairseq/data/audio/speech_to_speech_dataset.py", line 270, in collater
    task_target = task_dataset.collater(d)
  File "/tmp/py_project/fairseq/fairseq/data/audio/speech_to_text_dataset.py", line 474, in collater
    prev_out = fairseq_data_utils.collate_tokens(
  File "/tmp/py_project/fairseq/fairseq/data/data_utils.py", line 70, in collate_tokens
    copy_tensor(v, res[i][size - len(v) :] if left_pad else res[i][: len(v)])
  File "/tmp/py_project/fairseq/fairseq/data/data_utils.py", line 62, in copy_tensor
    dst[0] = src[-1]
IndexError: index -1 is out of bounds for dimension 0 with size 0

Does anyone have any idea?

PrabhjotKaurGosal commented 1 year ago

@Haoheya - I got the exact same error as yours. My .tsv files and dict.txt is formatted the same way as yours too.. I am actively trying to debug. I will post it here if and when I am able to figure out the answer.

PrabhjotKaurGosal commented 1 year ago

@Haoheya - I was able to fix the error in my case. In my case, the sample names under 'id' in the .tsv files for the multitask was not matching exactly with the sample names in the speech-to-speech data in ${DATA_ROOT}/${SPLIT}.tsv.)

After I corrected the sample names in the .tsv file for the multitask data, the training started successfully.

Haoheya commented 1 year ago

thanks @PrabhjotKaurGosal ! it's very much appreciated!

PrabhjotKaurGosal commented 11 months ago

Hello @tarudesu, @LaHeriody - May I know what was your sample size for training and how many epochs did you have to train the model for? I am not getting good results in my case. I am afraid the sample size may be too small or I am not running enough epochs. My sample size for training is just over 1600 samples. I ran training for 25 epochs. Thanks!

PrabhjotKaurGosal commented 9 months ago

@9seven - I have not seen this error. You may want to check the config.yaml file. The attribute input_feat_per_channel is defined there. In my case, it is set to 80. It is interesting that you are seeing this error only during inference. The training step also uses the same config file. So, the problem is with the config.yaml file, the training should give errors as well.

9seven commented 9 months ago

@9seven - I have not seen this error. You may want to check the config.yaml file. The attribute input_feat_per_channel is defined there. In my case, it is set to 80. It is interesting that you are seeing this error only during inference. The training step also uses the same config file. So, the problem is with the config.yaml file, the training should give errors as well.

It seems that the training goes on well. Also, I compare my config.yaml file to the others' above, there's no difference between them hhh. Anyway, thanks for replying and look forward to your new videos update!!!