LAION-AI / CLAP

Contrastive Language-Audio Pretraining
https://arxiv.org/abs/2211.06687
Creative Commons Zero v1.0 Universal
1.43k stars 137 forks source link

size mismatch between model and ckpt #165

Open kayleeliyx opened 1 month ago

kayleeliyx commented 1 month ago

i am having exactly the same issue as this one: https://github.com/LAION-AI/CLAP/issues/162 the loaded model should be 630k-audioset-best.pt. i am trying to use ESC50 to finetune the clap model. later i want to use my own dataset to finetune the model.

thanks a lot for your help!

kayleeliyx commented 1 month ago

I am able to train the model with parameters settings like this: "--amodel=HTSAT-tiny", but I don't know what's the difference between HTSAT-base and why the model here https://huggingface.co/lukewys/laion_clap/tree/main doesn't work. Also how to change the dimension of the input 'audio_projection.0.weight' layer in the projection layer in the audio encoder, where it is expected to be [512, 2048].

tbrouns commented 3 weeks ago

The shape is only expected to be 2048 if you use the PANN-14 architecture, i.e. --amodel PANN-14.

However, as far as I know, the authors haven't released any pretrained weights based on the PANN-14 architecture.

Both HTSAT-tiny and HTSAT-base will use a 768 output dimension. The former is just a smaller version of the latter. HTSAT-tiny requires significantly less GPU memory and might be able to support larger batch sizes, but personally I found its performance to be worse than HTSAT-base.

These weights use the HTSAT-tiny architecture:

630k-audioset-best.pt 630k-audioset-fusion-best.pt 630k-best.pt 630k-fusion-best.pt

And these use the HTSAT-base:

music_audioset_epoch_15_esc_90.14.pt music_speech_audioset_epoch_15_esc_89.98.pt music_speech_epoch_15_esc_89.25.pt

Hope this helps