[Speech to speech translation with discrete units] produce almost the same audios from different test audios during inference

LaHeriody commented 1 year ago

❓ Questions and Help

I follow the doc here to do speech to speech translation with discrete units; firstly, I prepare target units use

python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py \
    --feature_type $TYPE \
    --kmeans_model_path $KM_MODEL_PATH \
    --acoustic_model_path $CKPT_PATH \
    --layer $LAYER \
    --manifest_path $MANIFEST \
    --out_quantized_file_path $OUT_QUANTIZED_FILE \
    --extension ".wav"

and I get test.txt/train.txt/valid.txt like this

common_voice_zh-CN_19112438.mp3|71 71 93 82 11 45 64 37 37 86 68 68 16 74 27 47 5 5 30 30 70 70 52 25 25 11 45 64 74 27 21 95 95 23 53 53 62 29 28 28 28 87 24 46 30 30 70 70 70 52 52 52 48 48 51 51 19 19 19 19 66 60 27 63 47 76 58 58 58 65 74 27 21 21 95 95 95 45 45 45 45 64 64 64 65 3 3 77 15 15 15 15 15 41 41 84 63 96 20 20

Secondly, I run the script below

DATA_ROOT=/home/lbh/re_split_dataset
SRC_AUDIO=${DATA_ROOT}/processed_zh_dataset
TGT_AUDIO=${DATA_ROOT}/processed_en_dataset
SPLIT1=train
SPLIT2=valid
SPLIT3=test
python /home/lbh/fairseq/examples/speech_to_speech/preprocessing/prep_s2ut_data.py \
  --source-dir $SRC_AUDIO --target-dir $TGT_AUDIO --data-split $SPLIT1 $SPLIT2 $SPLIT3 \
  --output-root $DATA_ROOT --reduce-unit \
  --vocoder-checkpoint $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG

and I get test.tsv/train.tsv/valid.tsv as shown below

id  src_audio   src_n_frames    tgt_audio   tgt_n_frames
common_voice_zh-CN_19112438.mp3 /home/lbh/re_split_dataset/processed_zh_dataset/test/common_voice_zh-CN_19112438.mp3.wav    417 71 93 82 11 45 64 37 86 68 16 74 27 47 5 30 70 52 25 11 45 64 74 27 21 95 23 53 62 29 28 87 24 46 30 70 52 48 51 19 66 60 27 63 47 76 58 65 74 27 21 95 45 64 65 3 77 15 41 84 63 96 20 62

I don't do Multitask data, I follow the script below to train my zh-en model

fairseq-train $DATA_ROOT \
  --config-yaml config.yaml \
  --task speech_to_speech --target-is-code --target-code-size 100 --vocoder code_hifigan  \
  --criterion speech_to_unit --label-smoothing 0.2 \
  --arch s2ut_transformer_fisher --share-decoder-input-output-embed \
  --dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \
  --train-subset train --valid-subset dev \
  --save-dir ${MODEL_DIR} \
  --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-7 --warmup-updates 10000 \
  --optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 10.0 \
  --max-update 400000 --max-tokens 20000 --max-target-positions 3000 --update-freq 4 \
  --seed 1 --fp16 --num-workers 8

after that, there exist checkpoint_best.pt in the $MODEL_DIR.

Inference step: I ran

fairseq-generate $DATA_ROOT \
  --config-yaml config.yaml \
  --task speech_to_speech --target-is-code --target-code-size 100 --vocoder code_hifigan \
  --path $MODEL_DIR/checkpoint_best.pt  --gen-subset $GEN_SUBSET \
  --max-tokens 50000 \
  --beam 10 --max-len-a 1 \
  --results-path ${RESULTS_PATH}

and I get generate-test.txt like this

T-1771  71 89 59 38 44 18 31 59 33 97 51 19 90 35 11 64 81 84 63 96 55 39 67 54 63 93 75 91 9 29 28 92 50 87 9 44 80 85 11 64 66 27 31 53 65 3 77 5 30 44 80 74 2 3 48 46 30 16 18 29 28 23 73 3 77 52 25 13 58 32 1 85 42 88 81 83 96 55 39 67 54 63 86 51 65 6 36 7 97 44 80 26 87 97 44 80 10 37 86 9 62 6 36 92 27 63 89 59 38 44 18 27 31 60 33 48 51 19 90 35 42 11 64 81 83 84 63 96 55 34 56 72 40 72 89 59 53 44 80 18 27 31 59 33 51 19 90 35 11 64 81 83 63 84 96 55 67 54 40 72 40 72 21 95 53 44 80 18 27 31 59 33 51 19 90 35 11 64 81 83 84 96 55 67 54 63 40 93 63 89 87 38 44 80 18 31 59 33 51 19 90 35 11 81 83 63 20
H-1771  -0.6879629492759705 71 72 86 53 44 80 82 62 6 36 7 87 9 16 77 23 44 18 99 82 99 98 0 30 25 73 16 77 66 27 21 95 87 24 61 58 9 1 21 95 23 42 88 81 83 63 96 55 39 67 54 63 82 73 70 14 68 44 80 85 75 33 68 44 18 85 5 1 85 23 44 80 18 6 36 7 87 9 16 77 44 18 85 11 64 65 99 3 82 87 5 30 1 66 63 78 52 25 94 32 1 85 73 16 77 66 27 21 95 23 53 44 80 18 21 95 11 64 81 84 96 55 39 67 54 63 86 53 44 80 82 73 62 99 3 82 87 5 30 1 66 63 78 52 25 94 32 1 85 73 16 77 66 63 21 95 53 44 80 18 21 95 11 64 81 83 20
D-1771  -0.6879629492759705 71 72 86 53 44 80 82 62 6 36 7 87 9 16 77 23 44 18 99 82 99 98 0 30 25 73 16 77 66 27 21 95 87 24 61 58 9 1 21 95 23 42 88 81 83 63 96 55 39 67 54 63 82 73 70 14 68 44 80 85 75 33 68 44 18 85 5 1 85 23 44 80 18 6 36 7 87 9 16 77 44 18 85 11 64 65 99 3 82 87 5 30 1 66 63 78 52 25 94 32 1 85 73 16 77 66 27 21 95 23 53 44 80 18 21 95 11 64 81 84 96 55 39 67 54 63 86 53 44 80 82 73 62 99 3 82 87 5 30 1 66 63 78 52 25 94 32 1 85 73 16 77 66 63 21 95 53 44 80 18 21 95 11 64 81 83 20
P-1771  -0.2670 -2.3002 -1.9321 -1.2817 -1.1338 -0.4314 -0.8351 -1.7154 -0.9658 -0.2577 -0.3137 -0.5839 -1.5242 -1.3634 -0.3085 -1.7460 -0.3079 -0.1976 -1.8827 -1.1614 -0.2611 -0.7000 -0.1434 -0.2306 -0.1416 -0.2650 -0.3604 -0.7388 -0.7650 -1.0839 -0.6327 -0.2492 -0.5388 -0.5828 -0.4113 -3.3100 -0.4407 -0.1338 -0.3206 -0.2418 -0.1354 -0.5031 -0.2432 -0.2077 -1.1272 -1.0915 -0.4812 -0.2259 -0.4384 -0.2544 -0.2873 -0.6774 -3.1711 -1.5131 -1.9929 -0.3056 -0.8888 -0.3013 -0.3748 -1.0500 -0.2637 -0.5862 -0.1202 -0.2354 -0.7606 -0.1075 -0.1449 -0.6325 -0.3661 -0.2559 -0.6310 -0.6219 -0.5600 -3.2046 -0.5241 -0.1095 -1.2170 -0.6692 -0.3835 -0.1399 -1.4203 -0.4034 -0.4851 -0.2057 -0.3796 -0.4688 -1.1671 -1.0333 -0.8851 -0.4676 -0.0921 -0.1708 -0.1908 -0.7281 -1.1082 -0.7786 -0.2655 -0.2442 -0.1250 -0.2622 -1.0461 -0.2859 -1.2976 -0.4142 -0.5645 -0.4539 -1.2490 -0.3750 -0.2678 -1.7468 -0.1950 -0.1964 -0.8918 -0.2242 -0.1082 -0.2253 -0.2370 -0.3357 -1.0039 -1.9815 -1.0564 -0.4477 -0.8445 -0.2601 -0.2475 -0.6835 -1.7813 -0.9894 -0.7588 -0.4364 -0.5355 -0.6941 -2.5266 -0.7698 -0.0898 -0.2869 -0.1010 -0.0818 -0.1783 -0.2157 -1.0999 -2.3911 -0.8705 -0.2044 -0.1794 -0.3713 -0.2659 -0.9162 -0.3322 -0.5379 -0.3697 -0.5761 -0.6804 -1.8392 -0.4565 -0.2952 -0.8720 -0.1807 -1.1145 -0.2379 -0.1910 -0.2198 -0.2330 -0.5720 -1.8214 -1.2252 -1.2773 -0.4156

I ran script below to convert unit sequences to waveform

grep "^D\-" ${RESULTS_PATH}/generate-${GEN_SUBSET}.txt | \
  sed 's/^D-//ig' | sort -nk1 | cut -f3 \
  > ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit

python examples/speech_to_speech/generate_waveform_from_code.py \
  --in-code-file ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit \
  --vocoder $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG \
  --results-path ${RESULTS_PATH} --dur-prediction

Here is my question, why totally different data in test.tsv produce almost the same audio during inference. test.tsv generated from prep_s2ut_data.py generate-test.txt generated from fairseq-generate $DATA_ROOT step (I renamed it to cache-generate-test.txt) generate-test.unit generated from Convert unit sequences to waveform step (I renamed it to cache-generate-test.unit) and some '.wav' files generated during inference step can be acquired here any help can be appreciated.