bytedance / SALMONN

SALMONN: Speech Audio Language Music Open Neural Network
https://bytedance.github.io/SALMONN/
Apache License 2.0
908 stars 63 forks source link

a Few questions... #46

Open SoshyHayami opened 1 month ago

SoshyHayami commented 1 month ago

Hi, Thanks for bringing this awesome work to us.

I'll jump straight to the questions I have:

1- is there any particular reason you choose Vicuna? is the code compatible with Mistral or Llama 3? since they both seem to use a similar architecture. I'm really interested in Llama 3 as it has a strong multilingual capability out of the box.

2- It appears the model has a relatively weak performance on Languages other than English (haven't test Chinese), compared to Qwen-Audio. I assume the bottleneck would be the LLM and the distribution of the dataset, as the whisper's encoder should be more than capable of handling that. if that's the case then a simple LoRA, as it is used to instruction-tune the model, wouldn't work here. because llama 1 and llama 2 are just not good base models for multilingual capabilities. is that right?

3- what do you think about in-context learning? the current inference code doesn't keep track of the history, do you think it can boost the performance if we few-shot it? right now it requires quite a bit of prompting to make sure the model doesn't hallucinate.

Again, Thank you very much for this work. except for the limitations I faced above, I found Salmonn to be far Superior than any Audio LLM I've tried so far.

ucasyouzhao1987 commented 1 month ago

@SoshyHayami Recently, I also try to train multilingual ASR with Llama3-8B. I tested llama1 and llama2 for multilingual ASR. I found it is not a good choice for multilingual tasks. If you have any good results, please share me the results. Thank you!

SoshyHayami commented 1 month ago

@ucasyouzhao1987

llama 1 and 2 aren't good outside of the box. unless you're willing to go out of your way to do some hacks like expanding their tokenizer's vocab, continuing pre-training on your target language, etc. you probably won't get good results for these downstream tasks. llama 3-70B is currently the best pre-trained model with multilingual capacity.

if All you want is ASR, I think you get much better results if you simply use an ASR model then feed it to another LLM.

As for me, I need Salmonn mainly for audio captioning and Q&A. unfortunately I don't have the necessary compute to train Salmonn. I have 2xV100S(64gb overall vram) which I guess isn't enough.

Yu-Doit commented 1 month ago

Thank you for your attention to our work!

1- The reason we use Vicuna as the LLM is that Vicuna might be the best (at least better than LLaMA-1) open-source LLM at that time we develop SALMONN. As you said, there are some other LLMs with similar architecture right now, so it's definitely a good point to try some other LLMs. But you might need to do some modifications to our released code. For example for LLaMA-3, you need to use higher version torch and load the model with bf16, etc.

2- I agree with you that LLM is the bottleneck for multilingual capability and we didn't train SALMONN with as much as computational resources as Qwen-Audio.

3- ICL is definitely a very interesting point. Unfortunately, the current SALMONN is not dedicated to ICL, so I think some further fine-tuning is necessary to boost the ICL ability of SALMONN.