bytedance / SALMONN

SALMONN: Speech Audio Language Music Open Neural Network
https://bytedance.github.io/SALMONN/
Apache License 2.0
978 stars 75 forks source link

the instruction tuning stage #32

Closed yttas closed 8 months ago

yttas commented 8 months ago

I'm a little confused about the paper.

  1. It is considered that task over-fitting is caused by the instruction tuning stages. Why can't we directly use the model to do zero-shot task after the pre-training stage? In other words, what is the benefit of the instruction tuning stage for the model to do zero-shot task?
  2. In Figure 3, the accuracy and F1 score of SQQA are basically the same when lora scaling=0 and lora scaling=2. Is this phenomenon shows that Q-Former's cross-modal ability in the first step can solve this task?
TCL606 commented 8 months ago

In fact, the model that has only gone through the pre-training stage can only perform ASR and AAC tasks and is completely incapable of performing any zero-shot tasks. In other words, the problem of task overfitting is even more serious at this point. As you can see in the paper, we introduce a large amount of QA data in the instruction tuning stage, so that the model can see more abundant prompts, thus alleviating the situation of the model not following instructions. However, the model is still struggling to do more difficult tasks, without reducing the lora factor or being activated.

For your second question, I don't quite understand. In pre-training stage and instruction tuning stage, the Q-Former and LoRA are both updated. We used the model after instruction tuning to plot Figure 3. I think the phenomenon your mentioned can only demonstrate that reducing lora scaling to 2.0 is sufficient to activate the model capacity, but does not directly indicate that the pre-trained model can solve these tasks.

TCL606 commented 8 months ago

I will close this issue. If you have any question, welcome to reopen it.