Open SoshyHayami opened 1 month ago
@SoshyHayami Recently, I also try to train multilingual ASR with Llama3-8B. I tested llama1 and llama2 for multilingual ASR. I found it is not a good choice for multilingual tasks. If you have any good results, please share me the results. Thank you!
@ucasyouzhao1987
llama 1 and 2 aren't good outside of the box. unless you're willing to go out of your way to do some hacks like expanding their tokenizer's vocab, continuing pre-training on your target language, etc. you probably won't get good results for these downstream tasks. llama 3-70B is currently the best pre-trained model with multilingual capacity.
if All you want is ASR, I think you get much better results if you simply use an ASR model then feed it to another LLM.
As for me, I need Salmonn mainly for audio captioning and Q&A. unfortunately I don't have the necessary compute to train Salmonn. I have 2xV100S(64gb overall vram) which I guess isn't enough.
Thank you for your attention to our work!
1- The reason we use Vicuna as the LLM is that Vicuna might be the best (at least better than LLaMA-1) open-source LLM at that time we develop SALMONN. As you said, there are some other LLMs with similar architecture right now, so it's definitely a good point to try some other LLMs. But you might need to do some modifications to our released code. For example for LLaMA-3, you need to use higher version torch and load the model with bf16, etc.
2- I agree with you that LLM is the bottleneck for multilingual capability and we didn't train SALMONN with as much as computational resources as Qwen-Audio.
3- ICL is definitely a very interesting point. Unfortunately, the current SALMONN is not dedicated to ICL, so I think some further fine-tuning is necessary to boost the ICL ability of SALMONN.
Hi, Thanks for bringing this awesome work to us.
I'll jump straight to the questions I have:
1- is there any particular reason you choose Vicuna? is the code compatible with Mistral or Llama 3? since they both seem to use a similar architecture. I'm really interested in Llama 3 as it has a strong multilingual capability out of the box.
2- It appears the model has a relatively weak performance on Languages other than English (haven't test Chinese), compared to Qwen-Audio. I assume the bottleneck would be the LLM and the distribution of the dataset, as the whisper's encoder should be more than capable of handling that. if that's the case then a simple LoRA, as it is used to instruction-tune the model, wouldn't work here. because llama 1 and llama 2 are just not good base models for multilingual capabilities. is that right?
3- what do you think about in-context learning? the current inference code doesn't keep track of the history, do you think it can boost the performance if we few-shot it? right now it requires quite a bit of prompting to make sure the model doesn't hallucinate.
Again, Thank you very much for this work. except for the limitations I faced above, I found Salmonn to be far Superior than any Audio LLM I've tried so far.