Open LaHeriody opened 1 year ago
I have the same issue as this. Does anyone have any idea?
@LaHeriody Have you fixed this one successfully?
@LaHeriody Have you fixed this one successfully?
Actually no, I have no idea about this issue.
@LaHeriody Have you fixed this one successfully?
Actually no, I have no idea about this issue.
Hope that anyone can explain! Ah, but, actually, could I have your file "config.yaml" for fairseq-generate?
@tarudesu I have upload config.yaml
to here, hope that can help you. By the way, may I have some .wav
audios generated from your inference step?
@LaHeriody Thank you so much! Here is some samples from my inference (I tried to train a ja-en translation). Almost the outputs are the same (they even have a long silent sound at the end).
@tarudesu I added multitask data during training step, and then use the trained model in inference step. I got some audios. sounds different but still not the exactly translated audio. hope that can help you.
@LaHeriody Ah, could I have your config for multitasking? Awww, still managing to fix this kind of stuff.
just the same as the doc said:
source_letter: # $TASK_NAME
decoder_type: transformer
dict: ${DATA_ROOT}/source_letter/dict.txt
data: ${DATA_ROOT}/source_letter
encoder_layer: 6
loss_weight: 8.0
target_letter:
decoder_type: transformer
dict: ${DATA_ROOT}/target_letter/dict.txt
data: ${DATA_ROOT}/target_letter
encoder_layer: 8
loss_weight: 8.0
decoder_target_ctc:
decoder_type: ctc
dict: ${DATA_ROOT}/decoder_target_ctc/dict.txt
data: ${DATA_ROOT}/decoder_target_ctc
decoder_layer: 3
loss_weight: 1.6
ah, but actually, I'm not sure what is the _$TASKNAME and dict.txt
$TASK_NAME is up to you, you can $TASK_NAME=my_task
dict.txt in source_letter
represent the dictionary of your source language text
in target_letter
and decoder_target_ctc
, dict.txt represent the target language text
here is a demo:
token1 frequency
token2 frequency
token3 frequency
...
Excuse me! It has been a long time, could I ask you if you have solved this kind of problem yet? @LaHeriody
Hi @tarudesu - I am also working on the same problem. So,far my results are consistent with your finding where I get the same audio prediction for all samples (without doing any multitask). I am preparing the multitask data now and still trying to figure out the "how" part.
Excuse me! I added multitask data during training step, I get .tsv files like this:
id tgt_text
sample_id_0 token1 token2 token3 ...
sample_id_1 token1 token2 token3 ...
...
dict.txt like this
token1 frequency
token2 frequency
token3 frequency
...
but I got an error
Traceback (most recent call last):
File "/root/miniconda3/bin/fairseq-train", line 33, in <module>
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
File "/tmp/py_project/fairseq/fairseq_cli/train.py", line 574, in cli_main
distributed_utils.call_main(cfg, main)
File "/tmp/py_project/fairseq/fairseq/distributed/utils.py", line 404, in call_main
main(cfg, **kwargs)
File "/tmp/py_project/fairseq/fairseq_cli/train.py", line 165, in main
extra_state, epoch_itr = checkpoint_utils.load_checkpoint(
File "/tmp/py_project/fairseq/fairseq/checkpoint_utils.py", line 279, in load_checkpoint
epoch_itr = trainer.get_train_iterator(
File "/tmp/py_project/fairseq/fairseq/trainer.py", line 736, in get_train_iterator
self.reset_dummy_batch(batch_iterator.first_batch)
File "/tmp/py_project/fairseq/fairseq/data/iterators.py", line 372, in first_batch
return self.collate_fn([self.dataset[i] for i in self.frozen_batches[0]])
File "/tmp/py_project/fairseq/fairseq/data/audio/speech_to_speech_dataset.py", line 270, in collater
task_target = task_dataset.collater(d)
File "/tmp/py_project/fairseq/fairseq/data/audio/speech_to_text_dataset.py", line 474, in collater
prev_out = fairseq_data_utils.collate_tokens(
File "/tmp/py_project/fairseq/fairseq/data/data_utils.py", line 70, in collate_tokens
copy_tensor(v, res[i][size - len(v) :] if left_pad else res[i][: len(v)])
File "/tmp/py_project/fairseq/fairseq/data/data_utils.py", line 62, in copy_tensor
dst[0] = src[-1]
IndexError: index -1 is out of bounds for dimension 0 with size 0
Does anyone have any idea?
@Haoheya - I got the exact same error as yours. My .tsv files and dict.txt is formatted the same way as yours too.. I am actively trying to debug. I will post it here if and when I am able to figure out the answer.
@Haoheya - I was able to fix the error in my case. In my case, the sample names under 'id' in the .tsv files for the multitask was not matching exactly with the sample names in the speech-to-speech data in ${DATA_ROOT}/${SPLIT}.tsv.)
After I corrected the sample names in the .tsv file for the multitask data, the training started successfully.
thanks @PrabhjotKaurGosal ! it's very much appreciated!
Hello @tarudesu, @LaHeriody - May I know what was your sample size for training and how many epochs did you have to train the model for? I am not getting good results in my case. I am afraid the sample size may be too small or I am not running enough epochs. My sample size for training is just over 1600 samples. I ran training for 25 epochs. Thanks!
@9seven - I have not seen this error. You may want to check the config.yaml file. The attribute input_feat_per_channel is defined there. In my case, it is set to 80. It is interesting that you are seeing this error only during inference. The training step also uses the same config file. So, the problem is with the config.yaml file, the training should give errors as well.
@9seven - I have not seen this error. You may want to check the config.yaml file. The attribute input_feat_per_channel is defined there. In my case, it is set to 80. It is interesting that you are seeing this error only during inference. The training step also uses the same config file. So, the problem is with the config.yaml file, the training should give errors as well.
It seems that the training goes on well. Also, I compare my config.yaml file to the others' above, there's no difference between them hhh. Anyway, thanks for replying and look forward to your new videos update!!!
❓ Questions and Help
I follow the doc here to do speech to speech translation with discrete units; firstly, I prepare target units use
and I get
test.txt
/train.txt
/valid.txt
like thisSecondly, I run the script below
and I get
test.tsv
/train.tsv
/valid.tsv
as shown belowI don't do
Multitask data
, I follow the script below to train my zh-en modelafter that, there exist
checkpoint_best.pt
in the $MODEL_DIR.Inference step: I ran
and I get
generate-test.txt
like thisI ran script below to convert unit sequences to waveform
Here is my question, why totally different data in test.tsv produce almost the same audio during inference.
test.tsv
generated from prep_s2ut_data.pygenerate-test.txt
generated from fairseq-generate $DATA_ROOT step (I renamed it tocache-generate-test.txt
)generate-test.unit
generated fromConvert unit sequences to waveform
step (I renamed it tocache-generate-test.unit
) and some '.wav' files generated during inference step can be acquired here any help can be appreciated.