Closed billweasley closed 1 week ago
Is there any mismatch between training and inference?
Hi @ddlBoJack
Thanks a lot for replying!
One thing I do observe - we have many no-speech audio, or very short audio segment is that in our testing dataset.
A first step I explore is to adding <NO_SPEECH>
in prompt while adding few randomly sampled no-speech music/noisy in our training data with such an sample as the tag. This seems working for no-speech audio.
But a thing I notice as well - the model tends to mark short utterances with <NO_SPEECH>
(before adding this tag) or being repetitive/refusing to transcript (before adding this no speech tags.)
The typical character for this kind of audio - very short - 0.5 - 1 seconds and few words in the ground truth (e.g. "hi", "okay", "right", "thank you", "wow" "that's helpful")
I am currently not sure if this is because of we do not have enough such audio in our training dataset, or if it is related to the a training parameter (will this might be related to the "projecter downsampling rate"?), or a bug from the code base? Do you have any suggests though?
It will be really appreciated to have any suggestions!
Thanks!
Hi@billweasley
I also encountered this dilemma and I trained the ASR recipe on the wenet dataset. However, things don't work well. The decoding results show that the model just tried to say sorry or repeat some word just like yours. And what I find is that the model seems to fall into a local optimum and its accuracy is just sticking to 45%. And I think it might be the problem of the prompt since I found the influence of the prompt is really huge. But I still have no idea how to fix this. And I find you try to use the
Hi @PigeonDan1
It shall converge really quick - at least in my setup - roughly 4000 ~ 5000 steps with batch size 4 I can see an "emerging" behaviour in the loss/accuracy, but seems a long training will be helpful in further improving the results; but due to the issue above I have not getting a production-ready results yet.
Hi @billweasley , thank you for your reply. For the point 1, I will try to apply in my experiments and my dataset is also in a short format, only 1s-4s sometimes <1s and it might help I think. And for the point 2, I also tried and I find the same result with you but I am running on the Chinese dataset so there seems no help for me. And for the point 3, I actually get the acc for 45% roughly ar 4000-5000 steps and it will keep for a long time(at least 3 epoch) and just improve a little. And this is mentioned in the article but the local optimum keeps shorter, roughly 4k steps. And I am try to wait the result of larger epochs. I will give you feedback if I achieve some progress.
Hi @PigeonDan1
prompt + Musan dataset with such an label seem helpful in controlling the format of output for audio without speech, or audio with very few speech. We can check if the particular tag existing in the hypo and remove unexpected outputs if it exists during post-processing. - For point 1, I tried the checkpoint from the authors - they seem uses ALL UPPERCASE label to distinguish the good results or random outputs from the LLM. That might also working to distinguish the model output good/malformed results
- The model performance still seems suffering for very short audio (e.g. 0.5 - 1s ), and seems performance drops for the start of audio in some cases (i.e. deletion errors). I suspect it is a bug somewhere - or maybe a characteristic of LLAVA style model. Not really sure on this though and need further investigation.
It shall converge really quick - at least in my setup - roughly 4000 ~ 5000 steps with batch size 4 I can see an "emerging" behaviour in the loss/accuracy, but seems a long training will be helpful in further improving the results; but due to the issue above I have not getting a production-ready results yet.
hello, @billweasley ,I have a question about point 1. The repetition problem is caused by the history tokens generated by LLM, is it possible that the model will generate many
@fclearner Thanks for your question.
It seems that it can happen that it has multiple
System Info
Pytorch 2.3.1+cu121 CUDA 12.2 GPU Nvidia H100 2 machines * 8, DDP only, FP16
Information
🐛 Describe the bug
Not really a bug... Tried to follow the instructions to fine-tuning the model in my company's in-house data (~24k hours English data, with mostly the config mentioned in the https://arxiv.org/abs/2402.08846 ).
When decoding, I find some output like the following 3 types errors:
These 3 issues make the WER pretty high - I am here for seeking advices - did the authors come across same issues or not? And does anyone have any suggestions?
Encoder: Hubert xtlarge LLM: Vicuna 7B v1.5
Error logs
N/A
Expected behavior
N/A