X-LANCE / SLAM-LLM

Speech, Language, Audio, Music Processing with Large Language Model
MIT License
579 stars 52 forks source link

question about q-former #127

Closed peggyxpxu closed 1 week ago

peggyxpxu commented 3 months ago

Hi sir: If i want to use q-former for projector in acc audiocaps, the length of the audio encoder Placeholder should set to 64?

cwx-worst-one commented 3 months ago

Yes, when setting the query length for q-former, it follows the configuration used in BLIP, which is set to 64. You can also customize the query length by modifying this line.

With this setup, you can refer to this code to set the parameter fix_length_audio=query_length (e.g., 64) in the config, ensuring that the placeholder length matches the number of queries.

peggyxpxu commented 3 months ago

Yes, when setting the query length for q-former, it follows the configuration used in BLIP, which is set to 64. You can also customize the query length by modifying this line.

With this setup, you can refer to this code to set the parameter fix_length_audio=query_length (e.g., 64) in the config, ensuring that the placeholder length matches the number of queries.

Thanks, i use q-former with whisper encoder to train, but the loss of the model is much larger than i use linear with whisper encoder. Have you ever done similar experiments?

peggyxpxu commented 3 months ago

I carefully analyzed the test results and found that the illusion of the model was very serious when using q-former.

cwx-worst-one commented 3 months ago

Yes, I have experimented with q-former, and I found that the train/validation loss is higher compared to using linear layers. This might be causing the model to exhibit hallucinations during training. However, I haven't done extensive hyperparameter tuning yet. You could try adjusting the query length and other parameters for further experimentation.