Cannot achieve your performance on speech-to-text

sarapapi commented 2 years ago

Hi again, I have tried to train your model without ASR pre-training and KD but it did not perform very well, as expected. So, I pre-train the encoder using the script you provided and I perform the KD using a model that I have, that is trained on Opus dataset and that scores more than 35 BLEU on MuST-C En-De tst-COMMON. Thus, I trained CAAT loading the pre-trained encoder and the MuST-C data with KD, as you prescribed in your README instructions. I always used exactly your code for launching the experiments and the evaluation. The differences stand on: 1) the filterbank that I already have and that I didn't recompute to be comparable with the Fairseq standard models, 2) the vocabulary, again I used the Fairseq vocabulary that are still SentencePiece vocabulary but of dimension 8k for both source and target, and 3) the KD that was performed with another model that is stronger than an MT model only trained on MuST-C. Notice that the filterbank and the vocabulary used for training CAAT have been used by me for training the Fairseq models scoring comparable results with the Facebook paper. However, when I got the results from SimulEval (which still takes ages to output something, I haven't discovered why) they are plausible but 3 or 4 BLEU points lower than the ones that you presented on the paper. What do you think is the problem? Are the changes that I have done very impactful on the results?

Thanks again

EDIT: I have noticed that the perplexity of CAAT model is ppl=15 at the end of the training, which is very high. In contrast, the pertaining phase reached a very low perplexity (under 2) so I wondering if there is something wrong with my CAAT training.

danliu2 commented 2 years ago

Hi, sorry for the lack of some extra scripts. Perplexity of my CAAT model is about 6 at the end of model training, so maybe some problems in your model training. Here are some tricks about model training, hope it helpful for you to check reasons:

CAAT should get comparable performance to standard speech transformer at the offline inference or simultaneous with very high latency. can you check the performance of offline model with same pretrained encoder and distillation?
the total batchsize (batchsize_per_gpu update_freq nGPUs) is important for model training stability. Is it smaller in your experiments compared to mine（2816000frames）？
What is the AL in your experiments, and the decistion step size in your training and inference. The lower AL the worse BLEU.
Sequence level knowledge distillation is important in my experiments. and I used both golden translation and generated pseudo label from MT model (so the training data size doubled), this may be differrent from some previous work about sequence distill, but it's more effective.
offline auxilary loss function is very important for model training Hope these to be helpful for you. Feel free to let me know if you have further questions, and you can discuss with me directly by mail to danliu@mail.ustc.edu.cn.

sarapapi commented 2 years ago

Hi, thanks for your quick reply. I will evaluate the offline performance as soon as I resolved the problem in issue #11 and I let you know but the ppl clearly shows that something went wrong in my training. I used only the distilled data for training, I think this really made the difference. Thus, I will do like you have done and I will use both the distilled and the original data for training. Just to be sure, your max tokens were 16000, the number of GPUs is 2 and the update frequency is 8, right? Since I do not have GPU V100 32G I will use max tokens as 8000, 4 GPUs, and the update frequency is still 8, right?

Thank you again!

danliu2 commented 2 years ago

8000*4*8 identical to my experiments. Is it ok now?

sarapapi commented 2 years ago

Hi, the performance of the model is not enough good again. I am inferencing the En-Es model which got a 6.37 ppl right now, so I haven't the result yet and I am experiencing issues with the offline generate (described in #11) but I have partial results for En-De which have a 8.32 ppl and, for example, at 1988 AL I got 16.64 BLEU that is like 6 BLEU points less than the results in your paper. Notice that with the training using the KD data only I scored about 13 BLEU, so there was an improvement on the last training after your suggestions but, apparently, this is not enough. What can I do? Are you going to realize the checkpoint of your final model?

sarapapi commented 2 years ago

I noticed also an overgeneration in the predictions, also in the En-Es generation, have you noticed something similar? The ppl of the training is good but the online performance is even worse, I got 10 BLEU with 980 AL with Spanish, I think that there are some problems in the inference part since 1) the model overgenerates a lot, 2) it takes ages to generate that cannot depend only on my machines since the Fairseq model takes around 1h while CAAT 4 or 5 days...

danliu2 commented 2 years ago

Sorry to reply you so late due to the Chinese Spring Festival holiday. Here are my training log with --transducer-downsample 16 on EN-ES MuST-C: valid | epoch 078 | loss 7.801 | delay loss 0.064 | nll_loss 3.013 | ppl 8.07 | prob_loss 3.03 and the inference result on test-COMMON is AL: 623.148, AP: 0.674, DAL: 1359.39, BLEU: 25.84

Your PPL seems to be much smaller than mine, and my inference will not produce overgeneration. It seems your model are just trained for offline. So I guess, have you trained you model with my loss function " --criterion fake_loss --arch audio_cat" ? and what is the scale for latency loss "--delay-scale"? Can you send your full training log to my mail box and maybe I can help you find the reason.

The offline inference of CAAT model seems to be bugful by my and fairseq's modification, you may skip those bugs by bypass some code in fairseq's generate.py (https://github.com/danliu2/caat/issues/11#issuecomment-1032153850). That result should be helpful for our analyse.

danliu2 commented 2 years ago

solved

danliu2 / caat

Cannot achieve your performance on speech-to-text #10