released music model is trained both audio and text branch or just trained on audio branch?
i followed the finetune_script finding it only finetuneed audio branch for classification, is that right? if i want to finetune both text and audio branch i should unfreeze the text branch ?
Yes. If you want to fine-tune both text and audio branch, you should use the training script but not the fine-tune script. Sorry for the potential confusion of the naming, the fine-tune script is for fine-tuning the audio encoder for downstream task.
i have 2 questions: