Thank you for your excellent work!
I made some inferences about LTU and found that when I use close-ended questions, the model can give me correct answers (for example, identifying instruments and determining the speaker's gender). However, when I change the way I ask the questions, such as “What is the instrument in the audio?” or “What is the gender of the person speaking in the audio?”, it generally indicates that there is not enough information to make a judgment. Does this prove that there is a phenomenon of forgetting during the training phase of Llama, or could it be an issue with the alignment of different modalities?
Hi Yuan,
Thank you for your excellent work! I made some inferences about LTU and found that when I use close-ended questions, the model can give me correct answers (for example, identifying instruments and determining the speaker's gender). However, when I change the way I ask the questions, such as “What is the instrument in the audio?” or “What is the gender of the person speaking in the audio?”, it generally indicates that there is not enough information to make a judgment. Does this prove that there is a phenomenon of forgetting during the training phase of Llama, or could it be an issue with the alignment of different modalities?
Thank you for your help!