X-LANCE / SLAM-LLM

Speech, Language, Audio, Music Processing with Large Language Model
MIT License
576 stars 52 forks source link

Training suggestion...? For reducing LLM to produce like "I am sorry, I'm an AI language model and I don't have abilty to transcribe speech to text" #113

Closed billweasley closed 1 week ago

billweasley commented 4 months ago

System Info

Pytorch 2.3.1+cu121 CUDA 12.2 GPU Nvidia H100 2 machines * 8, DDP only, FP16

Information

🐛 Describe the bug

Not really a bug... Tried to follow the instructions to fine-tuning the model in my company's in-house data (~24k hours English data, with mostly the config mentioned in the https://arxiv.org/abs/2402.08846 ).

When decoding, I find some output like the following 3 types errors:

  1. LLM refuses to do decoding...and it outputs something like:
    • "I'm sorry, I'm not sure what you mean by "transcribe speech to text." Could you please provide more context or clarify your request?" -" I'm sorry, I'm an AI language model and I don't have the ability to transcribe speech to text. However, there are many speech-to-text software and apps available that can help you with that. You can search for "speech-to-text software" or "speech-to-text app" to find some options."
  2. Loopy output: Something like: "okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay okay"
  3. Empty output

These 3 issues make the WER pretty high - I am here for seeking advices - did the authors come across same issues or not? And does anyone have any suggestions?

Encoder: Hubert xtlarge LLM: Vicuna 7B v1.5

Error logs

N/A

Expected behavior

N/A

ddlBoJack commented 4 months ago

Is there any mismatch between training and inference?

billweasley commented 4 months ago

Hi @ddlBoJack

Thanks a lot for replying!

One thing I do observe - we have many no-speech audio, or very short audio segment is that in our testing dataset. A first step I explore is to adding <NO_SPEECH> in prompt while adding few randomly sampled no-speech music/noisy in our training data with such an sample as the tag. This seems working for no-speech audio.

But a thing I notice as well - the model tends to mark short utterances with <NO_SPEECH> (before adding this tag) or being repetitive/refusing to transcript (before adding this no speech tags.) The typical character for this kind of audio - very short - 0.5 - 1 seconds and few words in the ground truth (e.g. "hi", "okay", "right", "thank you", "wow" "that's helpful")

I am currently not sure if this is because of we do not have enough such audio in our training dataset, or if it is related to the a training parameter (will this might be related to the "projecter downsampling rate"?), or a bug from the code base? Do you have any suggests though?

It will be really appreciated to have any suggestions!

Thanks!

PigeonDan1 commented 3 months ago

Hi@billweasley I also encountered this dilemma and I trained the ASR recipe on the wenet dataset. However, things don't work well. The decoding results show that the model just tried to say sorry or repeat some word just like yours. And what I find is that the model seems to fall into a local optimum and its accuracy is just sticking to 45%. And I think it might be the problem of the prompt since I found the influence of the prompt is really huge. But I still have no idea how to fix this. And I find you try to use the tag to solve this, so does it work well?

billweasley commented 3 months ago

Hi @PigeonDan1

  1. prompt + Musan dataset with such an label seem helpful in controlling the format of output for audio without speech, or audio with very few speech. We can check if the particular tag existing in the hypo and remove unexpected outputs if it exists during post-processing.
  2. For point 1, I tried the checkpoint from the authors - they seem uses ALL UPPERCASE label to distinguish the good results or random outputs from the LLM. That might also working to distinguish the model output good/malformed results
  3. The model performance still seems suffering for very short audio (e.g. 0.5 - 1s ), and seems performance drops for the start of audio in some cases (i.e. deletion errors). I suspect it is a bug somewhere - or maybe a characteristic of LLAVA style model. Not really sure on this though and need further investigation.

It shall converge really quick - at least in my setup - roughly 4000 ~ 5000 steps with batch size 4 I can see an "emerging" behaviour in the loss/accuracy, but seems a long training will be helpful in further improving the results; but due to the issue above I have not getting a production-ready results yet.

PigeonDan1 commented 3 months ago

Hi @billweasley , thank you for your reply. For the point 1, I will try to apply in my experiments and my dataset is also in a short format, only 1s-4s sometimes <1s and it might help I think. And for the point 2, I also tried and I find the same result with you but I am running on the Chinese dataset so there seems no help for me. And for the point 3, I actually get the acc for 45% roughly ar 4000-5000 steps and it will keep for a long time(at least 3 epoch) and just improve a little. And this is mentioned in the article but the local optimum keeps shorter, roughly 4k steps. And I am try to wait the result of larger epochs. I will give you feedback if I achieve some progress.

fclearner commented 3 months ago

Hi @PigeonDan1

  1. prompt + Musan dataset with such an label seem helpful in controlling the format of output for audio without speech, or audio with very few speech. We can check if the particular tag existing in the hypo and remove unexpected outputs if it exists during post-processing.
  2. For point 1, I tried the checkpoint from the authors - they seem uses ALL UPPERCASE label to distinguish the good results or random outputs from the LLM. That might also working to distinguish the model output good/malformed results
  3. The model performance still seems suffering for very short audio (e.g. 0.5 - 1s ), and seems performance drops for the start of audio in some cases (i.e. deletion errors). I suspect it is a bug somewhere - or maybe a characteristic of LLAVA style model. Not really sure on this though and need further investigation.

It shall converge really quick - at least in my setup - roughly 4000 ~ 5000 steps with batch size 4 I can see an "emerging" behaviour in the loss/accuracy, but seems a long training will be helpful in further improving the results; but due to the issue above I have not getting a production-ready results yet.

hello, @billweasley ,I have a question about point 1. The repetition problem is caused by the history tokens generated by LLM, is it possible that the model will generate many when point 1 is used. I haven't tried this yet, just have a few doubts

billweasley commented 3 months ago

@fclearner Thanks for your question. It seems that it can happen that it has multiple decoded depends on the LLM's instruction following capability. But the thing is that we will have a fixed pattern for no speech audio so we can easily process it afterwards.