Open keiouok opened 2 years ago
Do you use the SimulEval ? why not share how you use it in detail? there exist much difference in agent.py .
@keiouok
@duj12 Thank you for your empirical advice. Training in s2t_transformer is maybe efficient, I'll try to simul ST implement in s2t_transformer.
@1190301804 Yes, I followed the simuleval scripts in simul_mustc_example.md.
And, I happened maybe my extracted fbank feature was not suitable for default simultaneous ST.
In this issue, I failed in this result by using the same fbank feature extracted in offline MuST-C ST docs setting (vocab 8000, not global cmvn) in main branch at that time.
After that, I successed simulST (BLEU 11-12) using global cmvn feature following simul_mustc_example.md in 436166a00c2ecd1215df258f022608947cca2aa8
branch (both preprocess and train).
However, I failed in without-global cmvn feature (almost all outputs were (Applaus)).
I can't believe the different features between gcmvn and cmvn causes such different results... I'll check the detail when I have time. Anyway, we could reproduce the result. Thank you.
Sorry, By the way , Have you used the MMA model(text2text), Do you know how to write the agent.py in Ende dataset ?
@1190301804 Sorry, I have not used the MMA model...
simul_speech_to_text And, I happened maybe my extracted fbank feature was not suitable for default simultaneous ST. In this issue, I failed in this result by using the same fbank feature extracted in offline MuST-C ST docs setting (vocab 8000, not global cmvn) in main branch at that time. After that, I successed simulST (BLEU 11-12) using global cmvn feature following simul_mustc_example.md in 436166a00c2ecd1215df258f022608947cca2aa8 branch (both preprocess and train). However, I failed in without-global cmvn feature (almost all outputs were (Applaus)). I can't believe the different features between gcmvn and cmvn causes such different results... I'll check the detail when I have time. Anyway, we could reproduce the result. Thank you.
Hi, @keiouok . I firstly misused the config of 'non-global cmvn', too.
I trained with global cmvn, but inferenced with utterance cmvn, and the performance was poor.
Then I use global cmvn to evaluate, and I average the 5 best checkpoints in development set, the result remains poor(BLEU=0.05.The training set is 80-hour Chinese-English BSTC+CommonVoice data, with pretrained ASR whose CER is 13%, offline ST’s BLEU is 8.6), the translations in instance.log are totally irrelevant to the audio.
I think maybe the learning rate is too small, so I increase it to 1e-3/1e-2, and keep other parameters unchanged, but the result is worse, lr = 1e-4 is more proper for this task. I also notice that the --task in your script is different with simul_mustc_example.md(simul_speech_to_text V.S. speech_to_text ), but this doesn't matter, actually.
Now, I am confused, how can I reproduce the result of simul ST. Is the dataset I used too small?
Thank you, @duj12. I have only successed in En-De dataset. Now I'm trying to use MuST-C En-Ja dataset in https://iwslt.org/2022/offline#allowed-training-data. However, the result was poor even offline ST (pretrained ASR WER was about 14). The offline ST BLEU resulted in about 0.1 (the dev-st accuracy was about 30 and didn't surpass 40. The early stopping patience was 16. )
Now I'm tuning learning rate (convtransformer default simulst lr was 0.0005), but it didn't improved. When lr was larger like 0.02 / 0.002, the accuracy would be worse, when lr was smaller like 0.0001, the loss and accuracy convergence was slower but finally dev-st accuracy didn't surpass 30. I tried s2t_transformer also but result was the same. Needless to say, simulst was also terribly bad score even if wait-k (k=100).
I'm also confused about the result... I'm sorry I could help you now. If I found anything, I would suggest you again.
even worse ,following simul_mustc_example.md for preprocess and train . On the full MustC-ende dataset (69G), I trained ASR model for about 120hours on 8GPUs with 900 epoch (I shut down it because it is too slow ), and ST model for about 70 epoch(I stop it early because it is also too slow ), then for evaluation, I use seg_mustc_data.py to split the dataset, and I use 100 sentences of them for evaluation(simuleval Connection refused when testing set is large) . The result is very poor ...
Does anyone have some suggestions? Thank you!
2022-03-09 20:34:51 | INFO | simuleval.cli | Evaluation results:
{
"Quality": {
"BLEU": 0.2027780041409297
},
"Latency": {
"AL": 1248.1308325195312,
"AL_CA": 15497.652229003907,
"AP": 0.3861502431333065,
"AP_CA": 39.99524466373026,
"DAL": 1411.803270072937,
"DAL_CA": 19719.095799560546
}
}
After debugging the training code, I found the reason. There is an example: encoder_state has 14 frames, and text_sequence has 4 tokens, batch_size = 1. In wait-k(k=3) mode, the p_choose(in p_choose_strategy.py) is formulated as: [[[0 0 1 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 1 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 1 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 1 0 0 0 0 0 0 0 0]]] and alpha is the same as p_choose, beta is something like [[[0.3 0.3 0.4 0 0 0 0 0 0 0 0 0 0 0] [0.2 0.3 0.3 0.2 0 0 0 0 0 0 0 0 0 0] [0.2 0.2 0.1 0.2 0.3 0 0 0 0 0 0 0 0 0] [0.1 0.2 0.2 0.1 0.2 0.2 0 0 0 0 0 0 0 0]]] we can see, only context between [0~7) [0~k+text_seq_len) frame will be weighted sum, but the total context length is equal to the encoder_state' length.
So in this implementation, too many context near the tail are ignored, which may be the reason of poor performance.
I tried to train simultaneous speech translation following simul_mustc_example.md. I trained simulst in
bc3bd55ec98c39af45ff7323ae49bcbdf93acc36
branch (because in main1ef3d6a1a2cb7fa9937233c8bf796957871bfc94
branch, Not found arch error was occurred. Preprocess and pretraining ASR was in main branch. )However, the simuleval's BLEU result was terribly low. (documentation's BLEU : about 13)
And most of the prediction in instance.log were like (Applause), (Musik).
If there are any solutions, please let me know. Thank you.
Code
Pre-trained ASR : checkpoint_best.pt with this code
Simultaneous speech translation
What's your environment?
dev_st loss