Closed Neptune-S-777 closed 1 year ago
Hi all, I was tring to fine-tune the model with the trainning scipt. I got error when I use roberta or bert I can't load the checkpoint nieither. I will be glad if you can tell me whcih tmodel I should use. Regards
Hit the same issue today.
It seems like the transformers
library introduced some changes in version 4.31.0
, affecting the loading of the text branch (RoBERTa base model) - resulting in the state_dict error.
As a workaround you can fix the transformers
library to version 4.30.2
in your requirements (tested with music_audioset_epoch_15_esc_90.14.pt
). @Neptune-S-777 @csteinmetz1
Thanks for the workaround. I've tried it and it worked for the default model, however for the model you mentioned (which I downloaded from https://huggingface.co/lukewys/laion_clap/resolve/main/music_audioset_epoch_15_esc_90.14.pt
) I am getting the following errors (size mismatch and missing keys)
Missing key(s) in state_dict: "audio_branch.patch_embed.mel_conv2d.weight", "audio_branch.patch_embed.mel_conv2d.bias", "audio_branch.patch_embed.fusion_model.local_att.0.weight", "audio_branch.patch_embed.fusion_model.local_att.0.bias", "audio_branch.patch_embed.fusion_model.local_att.1.weight", "audio_branch.patch_embed.fusion_model.local_att.1.bias", "audio_branch.patch_embed.fusion_model.local_att.1.running_mean", "audio_branch.patch_embed.fusion_model.local_att.1.running_var", "audio_branch.patch_embed.fusion_model.local_att.3.weight", "audio_branch.patch_embed.fusion_model.local_att.3.bias", "audio_branch.patch_embed.fusion_model.local_att.4.weight", "audio_branch.patch_embed.fusion_model.local_att.4.bias", "audio_branch.patch_embed.fusion_model.local_att.4.running_mean", "audio_branch.patch_embed.fusion_model.local_att.4.running_var", "audio_branch.patch_embed.fusion_model.global_att.1.weight", "audio_branch.patch_embed.fusion_model.global_att.1.bias", "audio_branch.patch_embed.fusion_model.global_att.2.weight", "audio_branch.patch_embed.fusion_model.global_att.2.bias", "audio_branch.patch_embed.fusion_model.global_att.2.running_mean", "audio_branch.patch_embed.fusion_model.global_att.2.running_var", "audio_branch.patch_embed.fusion_model.global_att.4.weight", "audio_branch.patch_embed.fusion_model.global_att.4.bias", "audio_branch.patch_embed.fusion_model.global_att.5.weight", "audio_branch.patch_embed.fusion_model.global_att.5.bias", "audio_branch.patch_embed.fusion_model.global_att.5.running_mean", "audio_branch.patch_embed.fusion_model.global_att.5.running_var".
Unexpected key(s) in state_dict: "audio_branch.layers.2.blocks.6.norm1.weight", "audio_branch.layers.2.blocks.6.norm1.bias", "audio_branch.layers.2.blocks.6.attn.relative_position_bias_table", "audio_branch.layers.2.blocks.6.attn.relative_position_index", "audio_branch.layers.2.blocks.6.attn.qkv.weight", "audio_branch.layers.2.blocks.6.attn.qkv.bias", "audio_branch.layers.2.blocks.6.attn.proj.weight", "audio_branch.layers.2.blocks.6.attn.proj.bias", "audio_branch.layers.2.blocks.6.norm2.weight", "audio_branch.layers.2.blocks.6.norm2.bias", "audio_branch.layers.2.blocks.6.mlp.fc1.weight", "audio_branch.layers.2.blocks.6.mlp.fc1.bias", "audio_branch.layers.2.blocks.6.mlp.fc2.weight", "audio_branch.layers.2.blocks.6.mlp.fc2.bias", "audio_branch.layers.2.blocks.7.attn_mask", "audio_branch.layers.2.blocks.7.norm1.weight", "audio_branch.layers.2.blocks.7.norm1.bias", "audio_branch.layers.2.blocks.7.attn.relative_position_bias_table", "audio_branch.layers.2.blocks.7.attn.relative_position_index", "audio_branch.layers.2.blocks.7.attn.qkv.weight", "audio_branch.layers.2.blocks.7.attn.qkv.bias", "audio_branch.layers.2.blocks.7.attn.proj.weight", "audio_branch.layers.2.blocks.7.attn.proj.bias", "audio_branch.layers.2.blocks.7.norm2.weight", "audio_branch.layers.2.blocks.7.norm2.bias", "audio_branch.layers.2.blocks.7.mlp.fc1.weight", "audio_branch.layers.2.blocks.7.mlp.fc1.bias", "audio_branch.layers.2.blocks.7.mlp.fc2.weight", "audio_branch.layers.2.blocks.7.mlp.fc2.bias", "audio_branch.layers.2.blocks.8.norm1.weight", "audio_branch.layers.2.blocks.8.norm1.bias", "audio_branch.layers.2.blocks.8.attn.relative_position_bias_table", "audio_branch.layers.2.blocks.8.attn.relative_position_index", "audio_branch.layers.2.blocks.8.attn.qkv.weight", "audio_branch.layers.2.blocks.8.attn.qkv.bias", "audio_branch.layers.2.blocks.8.attn.proj.weight", "audio_branch.layers.2.blocks.8.attn.proj.bias", "audio_branch.layers.2.blocks.8.norm2.weight", "audio_branch.layers.2.blocks.8.norm2.bias", "audio_branch.layers.2.blocks.8.mlp.fc1.weight", "audio_branch.layers.2.blocks.8.mlp.fc1.bias", "audio_branch.layers.2.blocks.8.mlp.fc2.weight", "audio_branch.layers.2.blocks.8.mlp.fc2.bias", "audio_branch.layers.2.blocks.9.attn_mask", "audio_branch.layers.2.blocks.9.norm1.weight", "audio_branch.layers.2.blocks.9.norm1.bias", "audio_branch.layers.2.blocks.9.attn.relative_position_bias_table", "audio_branch.layers.2.blocks.9.attn.relative_position_index", "audio_branch.layers.2.blocks.9.attn.qkv.weight", "audio_branch.layers.2.blocks.9.attn.qkv.bias", "audio_branch.layers.2.blocks.9.attn.proj.weight", "audio_branch.layers.2.blocks.9.attn.proj.bias", "audio_branch.layers.2.blocks.9.norm2.weight", "audio_branch.layers.2.blocks.9.norm2.bias", "audio_branch.layers.2.blocks.9.mlp.fc1.weight", "audio_branch.layers.2.blocks.9.mlp.fc1.bias", "audio_branch.layers.2.blocks.9.mlp.fc2.weight", "audio_branch.layers.2.blocks.9.mlp.fc2.bias", "audio_branch.layers.2.blocks.10.norm1.weight", "audio_branch.layers.2.blocks.10.norm1.bias", "audio_branch.layers.2.blocks.10.attn.relative_position_bias_table", "audio_branch.layers.2.blocks.10.attn.relative_position_index", "audio_branch.layers.2.blocks.10.attn.qkv.weight", "audio_branch.layers.2.blocks.10.attn.qkv.bias", "audio_branch.layers.2.blocks.10.attn.proj.weight", "audio_branch.layers.2.blocks.10.attn.proj.bias", "audio_branch.layers.2.blocks.10.norm2.weight", "audio_branch.layers.2.blocks.10.norm2.bias", "audio_branch.layers.2.blocks.10.mlp.fc1.weight", "audio_branch.layers.2.blocks.10.mlp.fc1.bias", "audio_branch.layers.2.blocks.10.mlp.fc2.weight", "audio_branch.layers.2.blocks.10.mlp.fc2.bias", "audio_branch.layers.2.blocks.11.attn_mask", "audio_branch.layers.2.blocks.11.norm1.weight", "audio_branch.layers.2.blocks.11.norm1.bias", "audio_branch.layers.2.blocks.11.attn.relative_position_bias_table", "audio_branch.layers.2.blocks.11.attn.relative_position_index", "audio_branch.layers.2.blocks.11.attn.qkv.weight", "audio_branch.layers.2.blocks.11.attn.qkv.bias", "audio_branch.layers.2.blocks.11.attn.proj.weight", "audio_branch.layers.2.blocks.11.attn.proj.bias", "audio_branch.layers.2.blocks.11.norm2.weight", "audio_branch.layers.2.blocks.11.norm2.bias", "audio_branch.layers.2.blocks.11.mlp.fc1.weight", "audio_branch.layers.2.blocks.11.mlp.fc1.bias", "audio_branch.layers.2.blocks.11.mlp.fc2.weight", "audio_branch.layers.2.blocks.11.mlp.fc2.bias".
size mismatch for audio_branch.patch_embed.proj.weight: copying a param with shape torch.Size([128, 1, 4, 4]) from checkpoint, the shape in current model is torch.Size([96, 1, 4, 4]).
size mismatch for audio_branch.patch_embed.proj.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for audio_branch.patch_embed.norm.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for audio_branch.patch_embed.norm.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for audio_branch.layers.0.blocks.0.norm1.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for audio_branch.layers.0.blocks.0.norm1.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for audio_branch.layers.0.blocks.0.attn.qkv.weight: copying a param with shape torch.Size([384, 128]) from checkpoint, the shape in current model is torch.Size([288, 96]).
size mismatch for audio_branch.layers.0.blocks.0.attn.qkv.bias: copying a param with shape torch.Size([384]) from checkpoint, the shape in current model is torch.Size([288]).
size mismatch for audio_branch.layers.0.blocks.0.attn.proj.weight: copying a param with shape torch.Size([128, 128]) from checkpoint, the shape in current model is torch.Size([96, 96]).
size mismatch for audio_branch.layers.0.blocks.0.attn.proj.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for audio_branch.layers.0.blocks.0.norm2.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for audio_branch.layers.0.blocks.0.norm2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for audio_branch.layers.0.blocks.0.mlp.fc1.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([384, 96]).
size mismatch for audio_branch.layers.0.blocks.0.mlp.fc1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.0.blocks.0.mlp.fc2.weight: copying a param with shape torch.Size([128, 512]) from checkpoint, the shape in current model is torch.Size([96, 384]).
size mismatch for audio_branch.layers.0.blocks.0.mlp.fc2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for audio_branch.layers.0.blocks.1.norm1.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for audio_branch.layers.0.blocks.1.norm1.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for audio_branch.layers.0.blocks.1.attn.qkv.weight: copying a param with shape torch.Size([384, 128]) from checkpoint, the shape in current model is torch.Size([288, 96]).
size mismatch for audio_branch.layers.0.blocks.1.attn.qkv.bias: copying a param with shape torch.Size([384]) from checkpoint, the shape in current model is torch.Size([288]).
size mismatch for audio_branch.layers.0.blocks.1.attn.proj.weight: copying a param with shape torch.Size([128, 128]) from checkpoint, the shape in current model is torch.Size([96, 96]).
size mismatch for audio_branch.layers.0.blocks.1.attn.proj.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for audio_branch.layers.0.blocks.1.norm2.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for audio_branch.layers.0.blocks.1.norm2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for audio_branch.layers.0.blocks.1.mlp.fc1.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([384, 96]).
size mismatch for audio_branch.layers.0.blocks.1.mlp.fc1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.0.blocks.1.mlp.fc2.weight: copying a param with shape torch.Size([128, 512]) from checkpoint, the shape in current model is torch.Size([96, 384]).
size mismatch for audio_branch.layers.0.blocks.1.mlp.fc2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([96]).
size mismatch for audio_branch.layers.0.downsample.reduction.weight: copying a param with shape torch.Size([256, 512]) from checkpoint, the shape in current model is torch.Size([192, 384]).
size mismatch for audio_branch.layers.0.downsample.norm.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.0.downsample.norm.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.1.blocks.0.norm1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for audio_branch.layers.1.blocks.0.norm1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for audio_branch.layers.1.blocks.0.attn.qkv.weight: copying a param with shape torch.Size([768, 256]) from checkpoint, the shape in current model is torch.Size([576, 192]).
size mismatch for audio_branch.layers.1.blocks.0.attn.qkv.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([576]).
size mismatch for audio_branch.layers.1.blocks.0.attn.proj.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([192, 192]).
size mismatch for audio_branch.layers.1.blocks.0.attn.proj.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for audio_branch.layers.1.blocks.0.norm2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for audio_branch.layers.1.blocks.0.norm2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for audio_branch.layers.1.blocks.0.mlp.fc1.weight: copying a param with shape torch.Size([1024, 256]) from checkpoint, the shape in current model is torch.Size([768, 192]).
size mismatch for audio_branch.layers.1.blocks.0.mlp.fc1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for audio_branch.layers.1.blocks.0.mlp.fc2.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([192, 768]).
size mismatch for audio_branch.layers.1.blocks.0.mlp.fc2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for audio_branch.layers.1.blocks.1.norm1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for audio_branch.layers.1.blocks.1.norm1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for audio_branch.layers.1.blocks.1.attn.qkv.weight: copying a param with shape torch.Size([768, 256]) from checkpoint, the shape in current model is torch.Size([576, 192]).
size mismatch for audio_branch.layers.1.blocks.1.attn.qkv.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([576]).
size mismatch for audio_branch.layers.1.blocks.1.attn.proj.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([192, 192]).
size mismatch for audio_branch.layers.1.blocks.1.attn.proj.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for audio_branch.layers.1.blocks.1.norm2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for audio_branch.layers.1.blocks.1.norm2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for audio_branch.layers.1.blocks.1.mlp.fc1.weight: copying a param with shape torch.Size([1024, 256]) from checkpoint, the shape in current model is torch.Size([768, 192]).
size mismatch for audio_branch.layers.1.blocks.1.mlp.fc1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for audio_branch.layers.1.blocks.1.mlp.fc2.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([192, 768]).
size mismatch for audio_branch.layers.1.blocks.1.mlp.fc2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([192]).
size mismatch for audio_branch.layers.1.downsample.reduction.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([384, 768]).
size mismatch for audio_branch.layers.1.downsample.norm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for audio_branch.layers.1.downsample.norm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for audio_branch.layers.2.blocks.0.norm1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.0.norm1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.0.attn.qkv.weight: copying a param with shape torch.Size([1536, 512]) from checkpoint, the shape in current model is torch.Size([1152, 384]).
size mismatch for audio_branch.layers.2.blocks.0.attn.qkv.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([1152]).
size mismatch for audio_branch.layers.2.blocks.0.attn.proj.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([384, 384]).
size mismatch for audio_branch.layers.2.blocks.0.attn.proj.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.0.norm2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.0.norm2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.0.mlp.fc1.weight: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([1536, 384]).
size mismatch for audio_branch.layers.2.blocks.0.mlp.fc1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1536]).
size mismatch for audio_branch.layers.2.blocks.0.mlp.fc2.weight: copying a param with shape torch.Size([512, 2048]) from checkpoint, the shape in current model is torch.Size([384, 1536]).
size mismatch for audio_branch.layers.2.blocks.0.mlp.fc2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.1.norm1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.1.norm1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.1.attn.qkv.weight: copying a param with shape torch.Size([1536, 512]) from checkpoint, the shape in current model is torch.Size([1152, 384]).
size mismatch for audio_branch.layers.2.blocks.1.attn.qkv.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([1152]).
size mismatch for audio_branch.layers.2.blocks.1.attn.proj.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([384, 384]).
size mismatch for audio_branch.layers.2.blocks.1.attn.proj.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.1.norm2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.1.norm2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.1.mlp.fc1.weight: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([1536, 384]).
size mismatch for audio_branch.layers.2.blocks.1.mlp.fc1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1536]).
size mismatch for audio_branch.layers.2.blocks.1.mlp.fc2.weight: copying a param with shape torch.Size([512, 2048]) from checkpoint, the shape in current model is torch.Size([384, 1536]).
size mismatch for audio_branch.layers.2.blocks.1.mlp.fc2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.2.norm1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.2.norm1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.2.attn.qkv.weight: copying a param with shape torch.Size([1536, 512]) from checkpoint, the shape in current model is torch.Size([1152, 384]).
size mismatch for audio_branch.layers.2.blocks.2.attn.qkv.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([1152]).
size mismatch for audio_branch.layers.2.blocks.2.attn.proj.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([384, 384]).
size mismatch for audio_branch.layers.2.blocks.2.attn.proj.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.2.norm2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.2.norm2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.2.mlp.fc1.weight: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([1536, 384]).
size mismatch for audio_branch.layers.2.blocks.2.mlp.fc1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1536]).
size mismatch for audio_branch.layers.2.blocks.2.mlp.fc2.weight: copying a param with shape torch.Size([512, 2048]) from checkpoint, the shape in current model is torch.Size([384, 1536]).
size mismatch for audio_branch.layers.2.blocks.2.mlp.fc2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.3.norm1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.3.norm1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.3.attn.qkv.weight: copying a param with shape torch.Size([1536, 512]) from checkpoint, the shape in current model is torch.Size([1152, 384]).
size mismatch for audio_branch.layers.2.blocks.3.attn.qkv.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([1152]).
size mismatch for audio_branch.layers.2.blocks.3.attn.proj.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([384, 384]).
size mismatch for audio_branch.layers.2.blocks.3.attn.proj.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.3.norm2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.3.norm2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.3.mlp.fc1.weight: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([1536, 384]).
size mismatch for audio_branch.layers.2.blocks.3.mlp.fc1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1536]).
size mismatch for audio_branch.layers.2.blocks.3.mlp.fc2.weight: copying a param with shape torch.Size([512, 2048]) from checkpoint, the shape in current model is torch.Size([384, 1536]).
size mismatch for audio_branch.layers.2.blocks.3.mlp.fc2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.4.norm1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.4.norm1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.4.attn.qkv.weight: copying a param with shape torch.Size([1536, 512]) from checkpoint, the shape in current model is torch.Size([1152, 384]).
size mismatch for audio_branch.layers.2.blocks.4.attn.qkv.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([1152]).
size mismatch for audio_branch.layers.2.blocks.4.attn.proj.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([384, 384]).
size mismatch for audio_branch.layers.2.blocks.4.attn.proj.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.4.norm2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.4.norm2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.4.mlp.fc1.weight: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([1536, 384]).
size mismatch for audio_branch.layers.2.blocks.4.mlp.fc1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1536]).
size mismatch for audio_branch.layers.2.blocks.4.mlp.fc2.weight: copying a param with shape torch.Size([512, 2048]) from checkpoint, the shape in current model is torch.Size([384, 1536]).
size mismatch for audio_branch.layers.2.blocks.4.mlp.fc2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.5.norm1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.5.norm1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.5.attn.qkv.weight: copying a param with shape torch.Size([1536, 512]) from checkpoint, the shape in current model is torch.Size([1152, 384]).
size mismatch for audio_branch.layers.2.blocks.5.attn.qkv.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([1152]).
size mismatch for audio_branch.layers.2.blocks.5.attn.proj.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([384, 384]).
size mismatch for audio_branch.layers.2.blocks.5.attn.proj.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.5.norm2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.5.norm2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.blocks.5.mlp.fc1.weight: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([1536, 384]).
size mismatch for audio_branch.layers.2.blocks.5.mlp.fc1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1536]).
size mismatch for audio_branch.layers.2.blocks.5.mlp.fc2.weight: copying a param with shape torch.Size([512, 2048]) from checkpoint, the shape in current model is torch.Size([384, 1536]).
size mismatch for audio_branch.layers.2.blocks.5.mlp.fc2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([384]).
size mismatch for audio_branch.layers.2.downsample.reduction.weight: copying a param with shape torch.Size([1024, 2048]) from checkpoint, the shape in current model is torch.Size([768, 1536]).
size mismatch for audio_branch.layers.2.downsample.norm.weight: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1536]).
size mismatch for audio_branch.layers.2.downsample.norm.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([1536]).
size mismatch for audio_branch.layers.3.blocks.0.norm1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for audio_branch.layers.3.blocks.0.norm1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for audio_branch.layers.3.blocks.0.attn.qkv.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([2304, 768]).
size mismatch for audio_branch.layers.3.blocks.0.attn.qkv.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]).
size mismatch for audio_branch.layers.3.blocks.0.attn.proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for audio_branch.layers.3.blocks.0.attn.proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for audio_branch.layers.3.blocks.0.norm2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for audio_branch.layers.3.blocks.0.norm2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for audio_branch.layers.3.blocks.0.mlp.fc1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for audio_branch.layers.3.blocks.0.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for audio_branch.layers.3.blocks.0.mlp.fc2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for audio_branch.layers.3.blocks.0.mlp.fc2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for audio_branch.layers.3.blocks.1.norm1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for audio_branch.layers.3.blocks.1.norm1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for audio_branch.layers.3.blocks.1.attn.qkv.weight: copying a param with shape torch.Size([3072, 1024]) from checkpoint, the shape in current model is torch.Size([2304, 768]).
size mismatch for audio_branch.layers.3.blocks.1.attn.qkv.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([2304]).
size mismatch for audio_branch.layers.3.blocks.1.attn.proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for audio_branch.layers.3.blocks.1.attn.proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for audio_branch.layers.3.blocks.1.norm2.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for audio_branch.layers.3.blocks.1.norm2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for audio_branch.layers.3.blocks.1.mlp.fc1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for audio_branch.layers.3.blocks.1.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([3072]).
size mismatch for audio_branch.layers.3.blocks.1.mlp.fc2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for audio_branch.layers.3.blocks.1.mlp.fc2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for audio_branch.norm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for audio_branch.norm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
size mismatch for audio_branch.tscam_conv.weight: copying a param with shape torch.Size([527, 1024, 2, 3]) from checkpoint, the shape in current model is torch.Size([527, 768, 2, 3]).
size mismatch for audio_projection.0.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([512, 768]).
My code is simply
from src import laion_clap
model = laion_clap.CLAP_Module(enable_fusion=True)
model.load_ckpt('music_audioset_epoch_15_esc_90.14.pt')
Could you share how you tested @waldleitner?
Thanks
@waldleitner Thanks so much for your answer, the problem is solved. @PabloPeso Try to define the audio encoder with amodel= 'HTSAT-base'.
Thanks all for the report! I just updated the requirements.txt
.
Thanks @Neptune-S-777
Just in case others face the same issue, I had also to modify the option enable_fusion=False
, so it looks like this
model = laion_clap.CLAP_Module(enable_fusion=False, amodel= 'HTSAT-base')
model.load_ckpt('music_audioset_epoch_15_esc_90.14.pt')
https://huggingface.co/lukewys/laion_clap/resolve/main/music_audioset_epoch_15_esc_90.14.pt when I was tring to fine-tune the model with the trainning scipt. I'm still getting this error "AssertionError: bert/roberta/bart text encoder does not support pretrained models."
for the same model and transformer version 4.30.0 and 4.30.2 both. Please suggest a workaround. @waldleitner @lukewys @Neptune-S-777
I have tried to load music_audioset_epoch_15_esc_90.14.pt with the example code. Both my machine and colab present the following error. RuntimeError: Error(s) in loading state_dict for CLAP: Unexpected key(s) in state_dict: "text_branch.embeddings.position_ids". It may cause by unmatch audio encoder? Could you tell me which amodel I should choose? Regards