LAION-AI / CLAP

Contrastive Language-Audio Pretraining
https://arxiv.org/abs/2211.06687
Creative Commons Zero v1.0 Universal
1.43k stars 138 forks source link

Loading check point #113

Closed Neptune-S-777 closed 1 year ago

Neptune-S-777 commented 1 year ago

I have tried to load music_audioset_epoch_15_esc_90.14.pt with the example code. Both my machine and colab present the following error. RuntimeError: Error(s) in loading state_dict for CLAP: Unexpected key(s) in state_dict: "text_branch.embeddings.position_ids". It may cause by unmatch audio encoder? Could you tell me which amodel I should choose? Regards

Neptune-S-777 commented 1 year ago

Hi all, I was tring to fine-tune the model with the trainning scipt. I got error when I use roberta or bert I can't load the checkpoint nieither. I will be glad if you can tell me whcih tmodel I should use. Regards

csteinmetz1 commented 1 year ago

Hit the same issue today.

waldleitner commented 1 year ago

It seems like the transformers library introduced some changes in version 4.31.0, affecting the loading of the text branch (RoBERTa base model) - resulting in the state_dict error.

As a workaround you can fix the transformers library to version 4.30.2 in your requirements (tested with music_audioset_epoch_15_esc_90.14.pt). @Neptune-S-777 @csteinmetz1

PabloPeso commented 1 year ago

Thanks for the workaround. I've tried it and it worked for the default model, however for the model you mentioned (which I downloaded from https://huggingface.co/lukewys/laion_clap/resolve/main/music_audioset_epoch_15_esc_90.14.pt) I am getting the following errors (size mismatch and missing keys)

Missing key(s) in state_dict: "audio_branch.patch_embed.mel_conv2d.weight", "audio_branch.patch_embed.mel_conv2d.bias", "audio_branch.patch_embed.fusion_model.local_att.0.weight", "audio_branch.patch_embed.fusion_model.local_att.0.bias", "audio_branch.patch_embed.fusion_model.local_att.1.weight", "audio_branch.patch_embed.fusion_model.local_att.1.bias", "audio_branch.patch_embed.fusion_model.local_att.1.running_mean", "audio_branch.patch_embed.fusion_model.local_att.1.running_var", "audio_branch.patch_embed.fusion_model.local_att.3.weight", "audio_branch.patch_embed.fusion_model.local_att.3.bias", "audio_branch.patch_embed.fusion_model.local_att.4.weight", "audio_branch.patch_embed.fusion_model.local_att.4.bias", "audio_branch.patch_embed.fusion_model.local_att.4.running_mean", "audio_branch.patch_embed.fusion_model.local_att.4.running_var", "audio_branch.patch_embed.fusion_model.global_att.1.weight", "audio_branch.patch_embed.fusion_model.global_att.1.bias", "audio_branch.patch_embed.fusion_model.global_att.2.weight", "audio_branch.patch_embed.fusion_model.global_att.2.bias", "audio_branch.patch_embed.fusion_model.global_att.2.running_mean", "audio_branch.patch_embed.fusion_model.global_att.2.running_var", "audio_branch.patch_embed.fusion_model.global_att.4.weight", "audio_branch.patch_embed.fusion_model.global_att.4.bias", "audio_branch.patch_embed.fusion_model.global_att.5.weight", "audio_branch.patch_embed.fusion_model.global_att.5.bias", "audio_branch.patch_embed.fusion_model.global_att.5.running_mean", "audio_branch.patch_embed.fusion_model.global_att.5.running_var". 
        Unexpected key(s) in state_dict: "audio_branch.layers.2.blocks.6.norm1.weight", "audio_branch.layers.2.blocks.6.norm1.bias", "audio_branch.layers.2.blocks.6.attn.relative_position_bias_table", "audio_branch.layers.2.blocks.6.attn.relative_position_index", "audio_branch.layers.2.blocks.6.attn.qkv.weight", "audio_branch.layers.2.blocks.6.attn.qkv.bias", "audio_branch.layers.2.blocks.6.attn.proj.weight", "audio_branch.layers.2.blocks.6.attn.proj.bias", "audio_branch.layers.2.blocks.6.norm2.weight", "audio_branch.layers.2.blocks.6.norm2.bias", "audio_branch.layers.2.blocks.6.mlp.fc1.weight", "audio_branch.layers.2.blocks.6.mlp.fc1.bias", "audio_branch.layers.2.blocks.6.mlp.fc2.weight", "audio_branch.layers.2.blocks.6.mlp.fc2.bias", "audio_branch.layers.2.blocks.7.attn_mask", "audio_branch.layers.2.blocks.7.norm1.weight", "audio_branch.layers.2.blocks.7.norm1.bias", "audio_branch.layers.2.blocks.7.attn.relative_position_bias_table", "audio_branch.layers.2.blocks.7.attn.relative_position_index", "audio_branch.layers.2.blocks.7.attn.qkv.weight", "audio_branch.layers.2.blocks.7.attn.qkv.bias", "audio_branch.layers.2.blocks.7.attn.proj.weight", "audio_branch.layers.2.blocks.7.attn.proj.bias", "audio_branch.layers.2.blocks.7.norm2.weight", "audio_branch.layers.2.blocks.7.norm2.bias", "audio_branch.layers.2.blocks.7.mlp.fc1.weight", "audio_branch.layers.2.blocks.7.mlp.fc1.bias", "audio_branch.layers.2.blocks.7.mlp.fc2.weight", "audio_branch.layers.2.blocks.7.mlp.fc2.bias", "audio_branch.layers.2.blocks.8.norm1.weight", "audio_branch.layers.2.blocks.8.norm1.bias", "audio_branch.layers.2.blocks.8.attn.relative_position_bias_table", "audio_branch.layers.2.blocks.8.attn.relative_position_index", "audio_branch.layers.2.blocks.8.attn.qkv.weight", "audio_branch.layers.2.blocks.8.attn.qkv.bias", "audio_branch.layers.2.blocks.8.attn.proj.weight", "audio_branch.layers.2.blocks.8.attn.proj.bias", "audio_branch.layers.2.blocks.8.norm2.weight", "audio_branch.layers.2.blocks.8.norm2.bias", "audio_branch.layers.2.blocks.8.mlp.fc1.weight", "audio_branch.layers.2.blocks.8.mlp.fc1.bias", "audio_branch.layers.2.blocks.8.mlp.fc2.weight", "audio_branch.layers.2.blocks.8.mlp.fc2.bias", "audio_branch.layers.2.blocks.9.attn_mask", "audio_branch.layers.2.blocks.9.norm1.weight", "audio_branch.layers.2.blocks.9.norm1.bias", "audio_branch.layers.2.blocks.9.attn.relative_position_bias_table", "audio_branch.layers.2.blocks.9.attn.relative_position_index", "audio_branch.layers.2.blocks.9.attn.qkv.weight", "audio_branch.layers.2.blocks.9.attn.qkv.bias", "audio_branch.layers.2.blocks.9.attn.proj.weight", "audio_branch.layers.2.blocks.9.attn.proj.bias", "audio_branch.layers.2.blocks.9.norm2.weight", "audio_branch.layers.2.blocks.9.norm2.bias", "audio_branch.layers.2.blocks.9.mlp.fc1.weight", "audio_branch.layers.2.blocks.9.mlp.fc1.bias", "audio_branch.layers.2.blocks.9.mlp.fc2.weight", "audio_branch.layers.2.blocks.9.mlp.fc2.bias", "audio_branch.layers.2.blocks.10.norm1.weight", "audio_branch.layers.2.blocks.10.norm1.bias", "audio_branch.layers.2.blocks.10.attn.relative_position_bias_table", "audio_branch.layers.2.blocks.10.attn.relative_position_index", "audio_branch.layers.2.blocks.10.attn.qkv.weight", "audio_branch.layers.2.blocks.10.attn.qkv.bias", "audio_branch.layers.2.blocks.10.attn.proj.weight", "audio_branch.layers.2.blocks.10.attn.proj.bias", "audio_branch.layers.2.blocks.10.norm2.weight", "audio_branch.layers.2.blocks.10.norm2.bias", "audio_branch.layers.2.blocks.10.mlp.fc1.weight", "audio_branch.layers.2.blocks.10.mlp.fc1.bias", "audio_branch.layers.2.blocks.10.mlp.fc2.weight", "audio_branch.layers.2.blocks.10.mlp.fc2.bias", "audio_branch.layers.2.blocks.11.attn_mask", "audio_branch.layers.2.blocks.11.norm1.weight", "audio_branch.layers.2.blocks.11.norm1.bias", "audio_branch.layers.2.blocks.11.attn.relative_position_bias_table", "audio_branch.layers.2.blocks.11.attn.relative_position_index", "audio_branch.layers.2.blocks.11.attn.qkv.weight", "audio_branch.layers.2.blocks.11.attn.qkv.bias", "audio_branch.layers.2.blocks.11.attn.proj.weight", "audio_branch.layers.2.blocks.11.attn.proj.bias", "audio_branch.layers.2.blocks.11.norm2.weight", "audio_branch.layers.2.blocks.11.norm2.bias", "audio_branch.layers.2.blocks.11.mlp.fc1.weight", "audio_branch.layers.2.blocks.11.mlp.fc1.bias", "audio_branch.layers.2.blocks.11.mlp.fc2.weight", "audio_branch.layers.2.blocks.11.mlp.fc2.bias". 
        size mismatch for audio_branch.patch_embed.proj.weight: copying a param with shape torch.Size([128, 1, 4, 4]) from checkpoint, the shape in current model is torch.Size([96, 1, 4, 4]).
        size mismatch for audio_branch.patch_embed.proj.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
        size mismatch for audio_branch.patch_embed.norm.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
        size mismatch for audio_branch.patch_embed.norm.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
        size mismatch for audio_branch.layers.0.blocks.0.norm1.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
        size mismatch for audio_branch.layers.0.blocks.0.norm1.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
        size mismatch for audio_branch.layers.0.blocks.0.attn.qkv.weight: copying a param with shape torch.Size([384, 128]) from checkpoint, the shape in current model is torch.Size([288, 96]).
        size mismatch for audio_branch.layers.0.blocks.0.attn.qkv.bias: copying a param with shape torch.Size([384]) from checkpoint, the shape in current model is torch.Size([288]).
        size mismatch for audio_branch.layers.0.blocks.0.attn.proj.weight: copying a param with shape torch.Size([128, 128]) from checkpoint, the shape in current model is torch.Size([96, 96]).
        size mismatch for audio_branch.layers.0.blocks.0.attn.proj.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
        size mismatch for audio_branch.layers.0.blocks.0.norm2.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
        size mismatch for audio_branch.layers.0.blocks.0.norm2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
        size mismatch for audio_branch.layers.0.blocks.0.mlp.fc1.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([384, 96]).
        size mismatch for audio_branch.layers.0.blocks.0.mlp.fc1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.0.blocks.0.mlp.fc2.weight: copying a param with shape torch.Size([128, 512]) from checkpoint, the shape in current model is torch.Size([96, 384]).
        size mismatch for audio_branch.layers.0.blocks.0.mlp.fc2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
        size mismatch for audio_branch.layers.0.blocks.1.norm1.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
        size mismatch for audio_branch.layers.0.blocks.1.norm1.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
        size mismatch for audio_branch.layers.0.blocks.1.attn.qkv.weight: copying a param with shape torch.Size([384, 128]) from checkpoint, the shape in current model is torch.Size([288, 96]).
        size mismatch for audio_branch.layers.0.blocks.1.attn.qkv.bias: copying a param with shape torch.Size([384]) from checkpoint, the shape in current model is torch.Size([288]).
        size mismatch for audio_branch.layers.0.blocks.1.attn.proj.weight: copying a param with shape torch.Size([128, 128]) from checkpoint, the shape in current model is torch.Size([96, 96]).
        size mismatch for audio_branch.layers.0.blocks.1.attn.proj.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
        size mismatch for audio_branch.layers.0.blocks.1.norm2.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
        size mismatch for audio_branch.layers.0.blocks.1.norm2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
        size mismatch for audio_branch.layers.0.blocks.1.mlp.fc1.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([384, 96]).
        size mismatch for audio_branch.layers.0.blocks.1.mlp.fc1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.0.blocks.1.mlp.fc2.weight: copying a param with shape torch.Size([128, 512]) from checkpoint, the shape in current model is torch.Size([96, 384]).
        size mismatch for audio_branch.layers.0.blocks.1.mlp.fc2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
        size mismatch for audio_branch.layers.0.downsample.reduction.weight: copying a param with shape torch.Size([256, 512]) from checkpoint, the shape in current model is torch.Size([192, 384]).
        size mismatch for audio_branch.layers.0.downsample.norm.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.0.downsample.norm.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.1.blocks.0.norm1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
        size mismatch for audio_branch.layers.1.blocks.0.norm1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
        size mismatch for audio_branch.layers.1.blocks.0.attn.qkv.weight: copying a param with shape torch.Size([768, 256]) from checkpoint, the shape in current model is torch.Size([576, 192]).
        size mismatch for audio_branch.layers.1.blocks.0.attn.qkv.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([576]).
        size mismatch for audio_branch.layers.1.blocks.0.attn.proj.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([192, 192]).
        size mismatch for audio_branch.layers.1.blocks.0.attn.proj.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
        size mismatch for audio_branch.layers.1.blocks.0.norm2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
        size mismatch for audio_branch.layers.1.blocks.0.norm2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
        size mismatch for audio_branch.layers.1.blocks.0.mlp.fc1.weight: copying a param with shape torch.Size([1024, 256]) from checkpoint, the shape in current model is torch.Size([768, 192]).
        size mismatch for audio_branch.layers.1.blocks.0.mlp.fc1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for audio_branch.layers.1.blocks.0.mlp.fc2.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([192, 768]).
        size mismatch for audio_branch.layers.1.blocks.0.mlp.fc2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
        size mismatch for audio_branch.layers.1.blocks.1.norm1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
        size mismatch for audio_branch.layers.1.blocks.1.norm1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
        size mismatch for audio_branch.layers.1.blocks.1.attn.qkv.weight: copying a param with shape torch.Size([768, 256]) from checkpoint, the shape in current model is torch.Size([576, 192]).
        size mismatch for audio_branch.layers.1.blocks.1.attn.qkv.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([576]).
        size mismatch for audio_branch.layers.1.blocks.1.attn.proj.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([192, 192]).
        size mismatch for audio_branch.layers.1.blocks.1.attn.proj.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
        size mismatch for audio_branch.layers.1.blocks.1.norm2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
        size mismatch for audio_branch.layers.1.blocks.1.norm2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
        size mismatch for audio_branch.layers.1.blocks.1.mlp.fc1.weight: copying a param with shape torch.Size([1024, 256]) from checkpoint, the shape in current model is torch.Size([768, 192]).
        size mismatch for audio_branch.layers.1.blocks.1.mlp.fc1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for audio_branch.layers.1.blocks.1.mlp.fc2.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([192, 768]).
        size mismatch for audio_branch.layers.1.blocks.1.mlp.fc2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
        size mismatch for audio_branch.layers.1.downsample.reduction.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([384, 768]).
        size mismatch for audio_branch.layers.1.downsample.norm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for audio_branch.layers.1.downsample.norm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for audio_branch.layers.2.blocks.0.norm1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.0.norm1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.0.attn.qkv.weight: copying a param with shape torch.Size([1536, 512]) from checkpoint, the shape in current model is torch.Size([1152, 384]).
        size mismatch for audio_branch.layers.2.blocks.0.attn.qkv.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([1152]).
        size mismatch for audio_branch.layers.2.blocks.0.attn.proj.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([384, 384]).
        size mismatch for audio_branch.layers.2.blocks.0.attn.proj.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.0.norm2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.0.norm2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.0.mlp.fc1.weight: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([1536, 384]).
        size mismatch for audio_branch.layers.2.blocks.0.mlp.fc1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1536]).
        size mismatch for audio_branch.layers.2.blocks.0.mlp.fc2.weight: copying a param with shape torch.Size([512, 2048]) from checkpoint, the shape in current model is torch.Size([384, 1536]).
        size mismatch for audio_branch.layers.2.blocks.0.mlp.fc2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.1.norm1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.1.norm1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.1.attn.qkv.weight: copying a param with shape torch.Size([1536, 512]) from checkpoint, the shape in current model is torch.Size([1152, 384]).
        size mismatch for audio_branch.layers.2.blocks.1.attn.qkv.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([1152]).
        size mismatch for audio_branch.layers.2.blocks.1.attn.proj.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([384, 384]).
        size mismatch for audio_branch.layers.2.blocks.1.attn.proj.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.1.norm2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.1.norm2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.1.mlp.fc1.weight: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([1536, 384]).
        size mismatch for audio_branch.layers.2.blocks.1.mlp.fc1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1536]).
        size mismatch for audio_branch.layers.2.blocks.1.mlp.fc2.weight: copying a param with shape torch.Size([512, 2048]) from checkpoint, the shape in current model is torch.Size([384, 1536]).
        size mismatch for audio_branch.layers.2.blocks.1.mlp.fc2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.2.norm1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.2.norm1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.2.attn.qkv.weight: copying a param with shape torch.Size([1536, 512]) from checkpoint, the shape in current model is torch.Size([1152, 384]).
        size mismatch for audio_branch.layers.2.blocks.2.attn.qkv.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([1152]).
        size mismatch for audio_branch.layers.2.blocks.2.attn.proj.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([384, 384]).
        size mismatch for audio_branch.layers.2.blocks.2.attn.proj.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.2.norm2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.2.norm2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.2.mlp.fc1.weight: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([1536, 384]).
        size mismatch for audio_branch.layers.2.blocks.2.mlp.fc1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1536]).
        size mismatch for audio_branch.layers.2.blocks.2.mlp.fc2.weight: copying a param with shape torch.Size([512, 2048]) from checkpoint, the shape in current model is torch.Size([384, 1536]).
        size mismatch for audio_branch.layers.2.blocks.2.mlp.fc2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.3.norm1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.3.norm1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.3.attn.qkv.weight: copying a param with shape torch.Size([1536, 512]) from checkpoint, the shape in current model is torch.Size([1152, 384]).
        size mismatch for audio_branch.layers.2.blocks.3.attn.qkv.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([1152]).
        size mismatch for audio_branch.layers.2.blocks.3.attn.proj.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([384, 384]).
        size mismatch for audio_branch.layers.2.blocks.3.attn.proj.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.3.norm2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.3.norm2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.3.mlp.fc1.weight: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([1536, 384]).
        size mismatch for audio_branch.layers.2.blocks.3.mlp.fc1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1536]).
        size mismatch for audio_branch.layers.2.blocks.3.mlp.fc2.weight: copying a param with shape torch.Size([512, 2048]) from checkpoint, the shape in current model is torch.Size([384, 1536]).
        size mismatch for audio_branch.layers.2.blocks.3.mlp.fc2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.4.norm1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.4.norm1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.4.attn.qkv.weight: copying a param with shape torch.Size([1536, 512]) from checkpoint, the shape in current model is torch.Size([1152, 384]).
        size mismatch for audio_branch.layers.2.blocks.4.attn.qkv.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([1152]).
        size mismatch for audio_branch.layers.2.blocks.4.attn.proj.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([384, 384]).
        size mismatch for audio_branch.layers.2.blocks.4.attn.proj.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.4.norm2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.4.norm2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.4.mlp.fc1.weight: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([1536, 384]).
        size mismatch for audio_branch.layers.2.blocks.4.mlp.fc1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1536]).
        size mismatch for audio_branch.layers.2.blocks.4.mlp.fc2.weight: copying a param with shape torch.Size([512, 2048]) from checkpoint, the shape in current model is torch.Size([384, 1536]).
        size mismatch for audio_branch.layers.2.blocks.4.mlp.fc2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.5.norm1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.5.norm1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.5.attn.qkv.weight: copying a param with shape torch.Size([1536, 512]) from checkpoint, the shape in current model is torch.Size([1152, 384]).
        size mismatch for audio_branch.layers.2.blocks.5.attn.qkv.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([1152]).
        size mismatch for audio_branch.layers.2.blocks.5.attn.proj.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([384, 384]).
        size mismatch for audio_branch.layers.2.blocks.5.attn.proj.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.5.norm2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.5.norm2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.blocks.5.mlp.fc1.weight: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([1536, 384]).
        size mismatch for audio_branch.layers.2.blocks.5.mlp.fc1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1536]).
        size mismatch for audio_branch.layers.2.blocks.5.mlp.fc2.weight: copying a param with shape torch.Size([512, 2048]) from checkpoint, the shape in current model is torch.Size([384, 1536]).
        size mismatch for audio_branch.layers.2.blocks.5.mlp.fc2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
        size mismatch for audio_branch.layers.2.downsample.reduction.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([768, 1536]).
        size mismatch for audio_branch.layers.2.downsample.norm.weight: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1536]).
        size mismatch for audio_branch.layers.2.downsample.norm.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1536]).
        size mismatch for audio_branch.layers.3.blocks.0.norm1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for audio_branch.layers.3.blocks.0.norm1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for audio_branch.layers.3.blocks.0.attn.qkv.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([2304, 768]).
        size mismatch for audio_branch.layers.3.blocks.0.attn.qkv.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]).
        size mismatch for audio_branch.layers.3.blocks.0.attn.proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
        size mismatch for audio_branch.layers.3.blocks.0.attn.proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for audio_branch.layers.3.blocks.0.norm2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for audio_branch.layers.3.blocks.0.norm2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for audio_branch.layers.3.blocks.0.mlp.fc1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
        size mismatch for audio_branch.layers.3.blocks.0.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
        size mismatch for audio_branch.layers.3.blocks.0.mlp.fc2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
        size mismatch for audio_branch.layers.3.blocks.0.mlp.fc2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for audio_branch.layers.3.blocks.1.norm1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for audio_branch.layers.3.blocks.1.norm1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for audio_branch.layers.3.blocks.1.attn.qkv.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([2304, 768]).
        size mismatch for audio_branch.layers.3.blocks.1.attn.qkv.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]).
        size mismatch for audio_branch.layers.3.blocks.1.attn.proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
        size mismatch for audio_branch.layers.3.blocks.1.attn.proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for audio_branch.layers.3.blocks.1.norm2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for audio_branch.layers.3.blocks.1.norm2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for audio_branch.layers.3.blocks.1.mlp.fc1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
        size mismatch for audio_branch.layers.3.blocks.1.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
        size mismatch for audio_branch.layers.3.blocks.1.mlp.fc2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
        size mismatch for audio_branch.layers.3.blocks.1.mlp.fc2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for audio_branch.norm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for audio_branch.norm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for audio_branch.tscam_conv.weight: copying a param with shape torch.Size([527, 1024, 2, 3]) from checkpoint, the shape in current model is torch.Size([527, 768, 2, 3]).
        size mismatch for audio_projection.0.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 768]).

My code is simply

from src import laion_clap

model = laion_clap.CLAP_Module(enable_fusion=True)
model.load_ckpt('music_audioset_epoch_15_esc_90.14.pt')

Could you share how you tested @waldleitner?

Thanks

Neptune-S-777 commented 1 year ago

@waldleitner Thanks so much for your answer, the problem is solved. @PabloPeso Try to define the audio encoder with amodel= 'HTSAT-base'.

lukewys commented 1 year ago

Thanks all for the report! I just updated the requirements.txt .

PabloPeso commented 1 year ago

Thanks @Neptune-S-777

Just in case others face the same issue, I had also to modify the option enable_fusion=False, so it looks like this

model = laion_clap.CLAP_Module(enable_fusion=False, amodel= 'HTSAT-base')
model.load_ckpt('music_audioset_epoch_15_esc_90.14.pt')
FeminaS17 commented 3 months ago

https://huggingface.co/lukewys/laion_clap/resolve/main/music_audioset_epoch_15_esc_90.14.pt when I was tring to fine-tune the model with the trainning scipt. I'm still getting this error "AssertionError: bert/roberta/bart text encoder does not support pretrained models."

for the same model and transformer version 4.30.0 and 4.30.2 both. Please suggest a workaround. @waldleitner @lukewys @Neptune-S-777