SysCV / sam-hq

Segment Anything in High Quality [NeurIPS 2023]
https://arxiv.org/abs/2306.01567
Apache License 2.0
3.66k stars 220 forks source link

How to train vit_tiny (Light HQ-SAM for real-time need): ViT-Tiny HQ-SAM model? #130

Open Andy718811 opened 5 months ago

Andy718811 commented 5 months ago

I tried to train the vit_tiny with tis argument, "python -m torch.distributed.launch --nproc_per_node=1 train.py --checkpoint ./pretrained_checkpoint/sam_hq_vit_tiny.pth --model-type vit_b --output work_dirs/hq_sam_tiny_l", but faced this problem, it seems tain.py can't be used to train the vit_tiny model. The full error message is down below. Traceback (most recent call last): File "train.py", line 700, in main(net, train_datasets, valid_datasets, args) File "train.py", line 366, in main train(args, net, optimizer, train_dataloaders, valid_dataloaders, lr_scheduler) File "train.py", line 393, in train sam = sam_model_registryargs.model_type File "/data/4TB/FENG/sam-hq-main/train/segment_anything_training/build_sam.py", line 38, in build_sam_vit_b return _build_sam( File "/data/4TB/FENG/sam-hq-main/train/segment_anything_training/build_sam.py", line 106, in _build_sam sam.load_state_dict(state_dict) File "/home/server3/anaconda3/envs/sam-hq/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Sam: Missing key(s) in state_dict: "image_encoder.pos_embed", "image_encoder.patch_embed.proj.weight", "image_encoder.patch_embed.proj.bias", "image_encoder.blocks.0.norm1.weight", "image_encoder.blocks.0.norm1.bias", "image_encoder.blocks.0.attn.rel_pos_h", "image_encoder.blocks.0.attn.rel_pos_w", "image_encoder.blocks.0.attn.qkv.weight", "image_encoder.blocks.0.attn.qkv.bias", "image_encoder.blocks.0.attn.proj.weight", "image_encoder.blocks.0.attn.proj.bias", "image_encoder.blocks.0.norm2.weight", "image_encoder.blocks.0.norm2.bias", "image_encoder.blocks.0.mlp.lin1.weight", "image_encoder.blocks.0.mlp.lin1.bias", "image_encoder.blocks.0.mlp.lin2.weight", "image_encoder.blocks.0.mlp.lin2.bias", "image_encoder.blocks.1.norm1.weight", "image_encoder.blocks.1.norm1.bias", "image_encoder.blocks.1.attn.rel_pos_h", "image_encoder.blocks.1.attn.rel_pos_w", "image_encoder.blocks.1.attn.qkv.weight", "image_encoder.blocks.1.attn.qkv.bias", "image_encoder.blocks.1.attn.proj.weight", "image_encoder.blocks.1.attn.proj.bias", "image_encoder.blocks.1.norm2.weight", "image_encoder.blocks.1.norm2.bias", "image_encoder.blocks.1.mlp.lin1.weight", "image_encoder.blocks.1.mlp.lin1.bias", "image_encoder.blocks.1.mlp.lin2.weight", "image_encoder.blocks.1.mlp.lin2.bias", "image_encoder.blocks.2.norm1.weight", "image_encoder.blocks.2.norm1.bias", "image_encoder.blocks.2.attn.rel_pos_h", "image_encoder.blocks.2.attn.rel_pos_w", "image_encoder.blocks.2.attn.qkv.weight", "image_encoder.blocks.2.attn.qkv.bias", "image_encoder.blocks.2.attn.proj.weight", "image_encoder.blocks.2.attn.proj.bias", "image_encoder.blocks.2.norm2.weight", "image_encoder.blocks.2.norm2.bias", "image_encoder.blocks.2.mlp.lin1.weight", "image_encoder.blocks.2.mlp.lin1.bias", "image_encoder.blocks.2.mlp.lin2.weight", "image_encoder.blocks.2.mlp.lin2.bias", "image_encoder.blocks.3.norm1.weight", "image_encoder.blocks.3.norm1.bias", "image_encoder.blocks.3.attn.rel_pos_h", "image_encoder.blocks.3.attn.rel_pos_w", "image_encoder.blocks.3.attn.qkv.weight", "image_encoder.blocks.3.attn.qkv.bias", "image_encoder.blocks.3.attn.proj.weight", "image_encoder.blocks.3.attn.proj.bias", "image_encoder.blocks.3.norm2.weight", "image_encoder.blocks.3.norm2.bias", "image_encoder.blocks.3.mlp.lin1.weight", "image_encoder.blocks.3.mlp.lin1.bias", "image_encoder.blocks.3.mlp.lin2.weight", "image_encoder.blocks.3.mlp.lin2.bias", "image_encoder.blocks.4.norm1.weight", "image_encoder.blocks.4.norm1.bias", "image_encoder.blocks.4.attn.rel_pos_h", "image_encoder.blocks.4.attn.rel_pos_w", "image_encoder.blocks.4.attn.qkv.weight", "image_encoder.blocks.4.attn.qkv.bias", "image_encoder.blocks.4.attn.proj.weight", "image_encoder.blocks.4.attn.proj.bias", "image_encoder.blocks.4.norm2.weight", "image_encoder.blocks.4.norm2.bias", "image_encoder.blocks.4.mlp.lin1.weight", "image_encoder.blocks.4.mlp.lin1.bias", "image_encoder.blocks.4.mlp.lin2.weight", "image_encoder.blocks.4.mlp.lin2.bias", "image_encoder.blocks.5.norm1.weight", "image_encoder.blocks.5.norm1.bias", "image_encoder.blocks.5.attn.rel_pos_h", "image_encoder.blocks.5.attn.rel_pos_w", "image_encoder.blocks.5.attn.qkv.weight", "image_encoder.blocks.5.attn.qkv.bias", "image_encoder.blocks.5.attn.proj.weight", "image_encoder.blocks.5.attn.proj.bias", "image_encoder.blocks.5.norm2.weight", "image_encoder.blocks.5.norm2.bias", "image_encoder.blocks.5.mlp.lin1.weight", "image_encoder.blocks.5.mlp.lin1.bias", "image_encoder.blocks.5.mlp.lin2.weight", "image_encoder.blocks.5.mlp.lin2.bias", "image_encoder.blocks.6.norm1.weight", "image_encoder.blocks.6.norm1.bias", "image_encoder.blocks.6.attn.rel_pos_h", "image_encoder.blocks.6.attn.rel_pos_w", "image_encoder.blocks.6.attn.qkv.weight", "image_encoder.blocks.6.attn.qkv.bias", "image_encoder.blocks.6.attn.proj.weight", "image_encoder.blocks.6.attn.proj.bias", "image_encoder.blocks.6.norm2.weight", "image_encoder.blocks.6.norm2.bias", "image_encoder.blocks.6.mlp.lin1.weight", "image_encoder.blocks.6.mlp.lin1.bias", "image_encoder.blocks.6.mlp.lin2.weight", "image_encoder.blocks.6.mlp.lin2.bias", "image_encoder.blocks.7.norm1.weight", "image_encoder.blocks.7.norm1.bias", "image_encoder.blocks.7.attn.rel_pos_h", "image_encoder.blocks.7.attn.rel_pos_w", "image_encoder.blocks.7.attn.qkv.weight", "image_encoder.blocks.7.attn.qkv.bias", "image_encoder.blocks.7.attn.proj.weight", "image_encoder.blocks.7.attn.proj.bias", "image_encoder.blocks.7.norm2.weight", "image_encoder.blocks.7.norm2.bias", "image_encoder.blocks.7.mlp.lin1.weight", "image_encoder.blocks.7.mlp.lin1.bias", "image_encoder.blocks.7.mlp.lin2.weight", "image_encoder.blocks.7.mlp.lin2.bias", "image_encoder.blocks.8.norm1.weight", "image_encoder.blocks.8.norm1.bias", "image_encoder.blocks.8.attn.rel_pos_h", "image_encoder.blocks.8.attn.rel_pos_w", "image_encoder.blocks.8.attn.qkv.weight", "image_encoder.blocks.8.attn.qkv.bias", "image_encoder.blocks.8.attn.proj.weight", "image_encoder.blocks.8.attn.proj.bias", "image_encoder.blocks.8.norm2.weight", "image_encoder.blocks.8.norm2.bias", "image_encoder.blocks.8.mlp.lin1.weight", "image_encoder.blocks.8.mlp.lin1.bias", "image_encoder.blocks.8.mlp.lin2.weight", "image_encoder.blocks.8.mlp.lin2.bias", "image_encoder.blocks.9.norm1.weight", "image_encoder.blocks.9.norm1.bias", "image_encoder.blocks.9.attn.rel_pos_h", "image_encoder.blocks.9.attn.rel_pos_w", "image_encoder.blocks.9.attn.qkv.weight", "image_encoder.blocks.9.attn.qkv.bias", "image_encoder.blocks.9.attn.proj.weight", "image_encoder.blocks.9.attn.proj.bias", "image_encoder.blocks.9.norm2.weight", "image_encoder.blocks.9.norm2.bias", "image_encoder.blocks.9.mlp.lin1.weight", "image_encoder.blocks.9.mlp.lin1.bias", "image_encoder.blocks.9.mlp.lin2.weight", "image_encoder.blocks.9.mlp.lin2.bias", "image_encoder.blocks.10.norm1.weight", "image_encoder.blocks.10.norm1.bias", "image_encoder.blocks.10.attn.rel_pos_h", "image_encoder.blocks.10.attn.rel_pos_w", "image_encoder.blocks.10.attn.qkv.weight", "image_encoder.blocks.10.attn.qkv.bias", "image_encoder.blocks.10.attn.proj.weight", "image_encoder.blocks.10.attn.proj.bias", "image_encoder.blocks.10.norm2.weight", "image_encoder.blocks.10.norm2.bias", "image_encoder.blocks.10.mlp.lin1.weight", "image_encoder.blocks.10.mlp.lin1.bias", "image_encoder.blocks.10.mlp.lin2.weight", "image_encoder.blocks.10.mlp.lin2.bias", "image_encoder.blocks.11.norm1.weight", "image_encoder.blocks.11.norm1.bias", "image_encoder.blocks.11.attn.rel_pos_h", "image_encoder.blocks.11.attn.rel_pos_w", "image_encoder.blocks.11.attn.qkv.weight", "image_encoder.blocks.11.attn.qkv.bias", "image_encoder.blocks.11.attn.proj.weight", "image_encoder.blocks.11.attn.proj.bias", "image_encoder.blocks.11.norm2.weight", "image_encoder.blocks.11.norm2.bias", "image_encoder.blocks.11.mlp.lin1.weight", "image_encoder.blocks.11.mlp.lin1.bias", "image_encoder.blocks.11.mlp.lin2.weight", "image_encoder.blocks.11.mlp.lin2.bias". Unexpected key(s) in state_dict: "image_encoder.layers.0.blocks.0.conv1.c.weight", "image_encoder.layers.0.blocks.0.conv1.bn.weight", "image_encoder.layers.0.blocks.0.conv1.bn.bias", "image_encoder.layers.0.blocks.0.conv1.bn.running_mean", "image_encoder.layers.0.blocks.0.conv1.bn.running_var", "image_encoder.layers.0.blocks.0.conv1.bn.num_batches_tracked", "image_encoder.layers.0.blocks.0.conv2.c.weight", "image_encoder.layers.0.blocks.0.conv2.bn.weight", "image_encoder.layers.0.blocks.0.conv2.bn.bias", "image_encoder.layers.0.blocks.0.conv2.bn.running_mean", "image_encoder.layers.0.blocks.0.conv2.bn.running_var", "image_encoder.layers.0.blocks.0.conv2.bn.num_batches_tracked", "image_encoder.layers.0.blocks.0.conv3.c.weight", "image_encoder.layers.0.blocks.0.conv3.bn.weight", "image_encoder.layers.0.blocks.0.conv3.bn.bias", "image_encoder.layers.0.blocks.0.conv3.bn.running_mean", "image_encoder.layers.0.blocks.0.conv3.bn.running_var", "image_encoder.layers.0.blocks.0.conv3.bn.num_batches_tracked", "image_encoder.layers.0.blocks.1.conv1.c.weight", "image_encoder.layers.0.blocks.1.conv1.bn.weight", "image_encoder.layers.0.blocks.1.conv1.bn.bias", "image_encoder.layers.0.blocks.1.conv1.bn.running_mean", "image_encoder.layers.0.blocks.1.conv1.bn.running_var", "image_encoder.layers.0.blocks.1.conv1.bn.num_batches_tracked", "image_encoder.layers.0.blocks.1.conv2.c.weight", "image_encoder.layers.0.blocks.1.conv2.bn.weight", "image_encoder.layers.0.blocks.1.conv2.bn.bias", "image_encoder.layers.0.blocks.1.conv2.bn.running_mean", "image_encoder.layers.0.blocks.1.conv2.bn.running_var", "image_encoder.layers.0.blocks.1.conv2.bn.num_batches_tracked", "image_encoder.layers.0.blocks.1.conv3.c.weight", "image_encoder.layers.0.blocks.1.conv3.bn.weight", "image_encoder.layers.0.blocks.1.conv3.bn.bias", "image_encoder.layers.0.blocks.1.conv3.bn.running_mean", "image_encoder.layers.0.blocks.1.conv3.bn.running_var", "image_encoder.layers.0.blocks.1.conv3.bn.num_batches_tracked", "image_encoder.layers.0.downsample.conv1.c.weight", "image_encoder.layers.0.downsample.conv1.bn.weight", "image_encoder.layers.0.downsample.conv1.bn.bias", "image_encoder.layers.0.downsample.conv1.bn.running_mean", "image_encoder.layers.0.downsample.conv1.bn.running_var", "image_encoder.layers.0.downsample.conv1.bn.num_batches_tracked", "image_encoder.layers.0.downsample.conv2.c.weight", "image_encoder.layers.0.downsample.conv2.bn.weight", "image_encoder.layers.0.downsample.conv2.bn.bias", "image_encoder.layers.0.downsample.conv2.bn.running_mean", "image_encoder.layers.0.downsample.conv2.bn.running_var", "image_encoder.layers.0.downsample.conv2.bn.num_batches_tracked", "image_encoder.layers.0.downsample.conv3.c.weight", "image_encoder.layers.0.downsample.conv3.bn.weight", "image_encoder.layers.0.downsample.conv3.bn.bias", "image_encoder.layers.0.downsample.conv3.bn.running_mean", "image_encoder.layers.0.downsample.conv3.bn.running_var", "image_encoder.layers.0.downsample.conv3.bn.num_batches_tracked", "image_encoder.layers.1.blocks.0.attn.attention_biases", "image_encoder.layers.1.blocks.0.attn.norm.weight", "image_encoder.layers.1.blocks.0.attn.norm.bias", "image_encoder.layers.1.blocks.0.attn.qkv.weight", "image_encoder.layers.1.blocks.0.attn.qkv.bias", "image_encoder.layers.1.blocks.0.attn.proj.weight", "image_encoder.layers.1.blocks.0.attn.proj.bias", "image_encoder.layers.1.blocks.0.mlp.norm.weight", "image_encoder.layers.1.blocks.0.mlp.norm.bias", "image_encoder.layers.1.blocks.0.mlp.fc1.weight", "image_encoder.layers.1.blocks.0.mlp.fc1.bias", "image_encoder.layers.1.blocks.0.mlp.fc2.weight", "image_encoder.layers.1.blocks.0.mlp.fc2.bias", "image_encoder.layers.1.blocks.0.local_conv.c.weight", "image_encoder.layers.1.blocks.0.local_conv.bn.weight", "image_encoder.layers.1.blocks.0.local_conv.bn.bias", "image_encoder.layers.1.blocks.0.local_conv.bn.running_mean", "image_encoder.layers.1.blocks.0.local_conv.bn.running_var", "image_encoder.layers.1.blocks.0.local_conv.bn.num_batches_tracked", "image_encoder.layers.1.blocks.1.attn.attention_biases", "image_encoder.layers.1.blocks.1.attn.norm.weight", "image_encoder.layers.1.blocks.1.attn.norm.bias", "image_encoder.layers.1.blocks.1.attn.qkv.weight", "image_encoder.layers.1.blocks.1.attn.qkv.bias", "image_encoder.layers.1.blocks.1.attn.proj.weight", "image_encoder.layers.1.blocks.1.attn.proj.bias", "image_encoder.layers.1.blocks.1.mlp.norm.weight", "image_encoder.layers.1.blocks.1.mlp.norm.bias", "image_encoder.layers.1.blocks.1.mlp.fc1.weight", "image_encoder.layers.1.blocks.1.mlp.fc1.bias", "image_encoder.layers.1.blocks.1.mlp.fc2.weight", "image_encoder.layers.1.blocks.1.mlp.fc2.bias", "image_encoder.layers.1.blocks.1.local_conv.c.weight", "image_encoder.layers.1.blocks.1.local_conv.bn.weight", "image_encoder.layers.1.blocks.1.local_conv.bn.bias", "image_encoder.layers.1.blocks.1.local_conv.bn.running_mean", "image_encoder.layers.1.blocks.1.local_conv.bn.running_var", "image_encoder.layers.1.blocks.1.local_conv.bn.num_batches_tracked", "image_encoder.layers.1.downsample.conv1.c.weight", "image_encoder.layers.1.downsample.conv1.bn.weight", "image_encoder.layers.1.downsample.conv1.bn.bias", "image_encoder.layers.1.downsample.conv1.bn.running_mean", "image_encoder.layers.1.downsample.conv1.bn.running_var", "image_encoder.layers.1.downsample.conv1.bn.num_batches_tracked", "image_encoder.layers.1.downsample.conv2.c.weight", "image_encoder.layers.1.downsample.conv2.bn.weight", "image_encoder.layers.1.downsample.conv2.bn.bias", "image_encoder.layers.1.downsample.conv2.bn.running_mean", "image_encoder.layers.1.downsample.conv2.bn.running_var", "image_encoder.layers.1.downsample.conv2.bn.num_batches_tracked", "image_encoder.layers.1.downsample.conv3.c.weight", "image_encoder.layers.1.downsample.conv3.bn.weight", "image_encoder.layers.1.downsample.conv3.bn.bias", "image_encoder.layers.1.downsample.conv3.bn.running_mean", "image_encoder.layers.1.downsample.conv3.bn.running_var", "image_encoder.layers.1.downsample.conv3.bn.num_batches_tracked", "image_encoder.layers.2.blocks.0.attn.attention_biases", "image_encoder.layers.2.blocks.0.attn.norm.weight", "image_encoder.layers.2.blocks.0.attn.norm.bias", "image_encoder.layers.2.blocks.0.attn.qkv.weight", "image_encoder.layers.2.blocks.0.attn.qkv.bias", "image_encoder.layers.2.blocks.0.attn.proj.weight", "image_encoder.layers.2.blocks.0.attn.proj.bias", "image_encoder.layers.2.blocks.0.mlp.norm.weight", "image_encoder.layers.2.blocks.0.mlp.norm.bias", "image_encoder.layers.2.blocks.0.mlp.fc1.weight", "image_encoder.layers.2.blocks.0.mlp.fc1.bias", "image_encoder.layers.2.blocks.0.mlp.fc2.weight", "image_encoder.layers.2.blocks.0.mlp.fc2.bias", "image_encoder.layers.2.blocks.0.local_conv.c.weight", "image_encoder.layers.2.blocks.0.local_conv.bn.weight", "image_encoder.layers.2.blocks.0.local_conv.bn.bias", "image_encoder.layers.2.blocks.0.local_conv.bn.running_mean", "image_encoder.layers.2.blocks.0.local_conv.bn.running_var", "image_encoder.layers.2.blocks.0.local_conv.bn.num_batches_tracked", "image_encoder.layers.2.blocks.1.attn.attention_biases", "image_encoder.layers.2.blocks.1.attn.norm.weight", "image_encoder.layers.2.blocks.1.attn.norm.bias", "image_encoder.layers.2.blocks.1.attn.qkv.weight", "image_encoder.layers.2.blocks.1.attn.qkv.bias", "image_encoder.layers.2.blocks.1.attn.proj.weight", "image_encoder.layers.2.blocks.1.attn.proj.bias", "image_encoder.layers.2.blocks.1.mlp.norm.weight", "image_encoder.layers.2.blocks.1.mlp.norm.bias", "image_encoder.layers.2.blocks.1.mlp.fc1.weight", "image_encoder.layers.2.blocks.1.mlp.fc1.bias", "image_encoder.layers.2.blocks.1.mlp.fc2.weight", "image_encoder.layers.2.blocks.1.mlp.fc2.bias", "image_encoder.layers.2.blocks.1.local_conv.c.weight", "image_encoder.layers.2.blocks.1.local_conv.bn.weight", "image_encoder.layers.2.blocks.1.local_conv.bn.bias", "image_encoder.layers.2.blocks.1.local_conv.bn.running_mean", "image_encoder.layers.2.blocks.1.local_conv.bn.running_var", "image_encoder.layers.2.blocks.1.local_conv.bn.num_batches_tracked", "image_encoder.layers.2.blocks.2.attn.attention_biases", "image_encoder.layers.2.blocks.2.attn.norm.weight", "image_encoder.layers.2.blocks.2.attn.norm.bias", "image_encoder.layers.2.blocks.2.attn.qkv.weight", "image_encoder.layers.2.blocks.2.attn.qkv.bias", "image_encoder.layers.2.blocks.2.attn.proj.weight", "image_encoder.layers.2.blocks.2.attn.proj.bias", "image_encoder.layers.2.blocks.2.mlp.norm.weight", "image_encoder.layers.2.blocks.2.mlp.norm.bias", "image_encoder.layers.2.blocks.2.mlp.fc1.weight", "image_encoder.layers.2.blocks.2.mlp.fc1.bias", "image_encoder.layers.2.blocks.2.mlp.fc2.weight", "image_encoder.layers.2.blocks.2.mlp.fc2.bias", "image_encoder.layers.2.blocks.2.local_conv.c.weight", "image_encoder.layers.2.blocks.2.local_conv.bn.weight", "image_encoder.layers.2.blocks.2.local_conv.bn.bias", "image_encoder.layers.2.blocks.2.local_conv.bn.running_mean", "image_encoder.layers.2.blocks.2.local_conv.bn.running_var", "image_encoder.layers.2.blocks.2.local_conv.bn.num_batches_tracked", "image_encoder.layers.2.blocks.3.attn.attention_biases", "image_encoder.layers.2.blocks.3.attn.norm.weight", "image_encoder.layers.2.blocks.3.attn.norm.bias", "image_encoder.layers.2.blocks.3.attn.qkv.weight", "image_encoder.layers.2.blocks.3.attn.qkv.bias", "image_encoder.layers.2.blocks.3.attn.proj.weight", "image_encoder.layers.2.blocks.3.attn.proj.bias", "image_encoder.layers.2.blocks.3.mlp.norm.weight", "image_encoder.layers.2.blocks.3.mlp.norm.bias", "image_encoder.layers.2.blocks.3.mlp.fc1.weight", "image_encoder.layers.2.blocks.3.mlp.fc1.bias", "image_encoder.layers.2.blocks.3.mlp.fc2.weight", "image_encoder.layers.2.blocks.3.mlp.fc2.bias", "image_encoder.layers.2.blocks.3.local_conv.c.weight", "image_encoder.layers.2.blocks.3.local_conv.bn.weight", "image_encoder.layers.2.blocks.3.local_conv.bn.bias", "image_encoder.layers.2.blocks.3.local_conv.bn.running_mean", "image_encoder.layers.2.blocks.3.local_conv.bn.running_var", "image_encoder.layers.2.blocks.3.local_conv.bn.num_batches_tracked", "image_encoder.layers.2.blocks.4.attn.attention_biases", "image_encoder.layers.2.blocks.4.attn.norm.weight", "image_encoder.layers.2.blocks.4.attn.norm.bias", "image_encoder.layers.2.blocks.4.attn.qkv.weight", "image_encoder.layers.2.blocks.4.attn.qkv.bias", "image_encoder.layers.2.blocks.4.attn.proj.weight", "image_encoder.layers.2.blocks.4.attn.proj.bias", "image_encoder.layers.2.blocks.4.mlp.norm.weight", "image_encoder.layers.2.blocks.4.mlp.norm.bias", "image_encoder.layers.2.blocks.4.mlp.fc1.weight", "image_encoder.layers.2.blocks.4.mlp.fc1.bias", "image_encoder.layers.2.blocks.4.mlp.fc2.weight", "image_encoder.layers.2.blocks.4.mlp.fc2.bias", "image_encoder.layers.2.blocks.4.local_conv.c.weight", "image_encoder.layers.2.blocks.4.local_conv.bn.weight", "image_encoder.layers.2.blocks.4.local_conv.bn.bias", "image_encoder.layers.2.blocks.4.local_conv.bn.running_mean", "image_encoder.layers.2.blocks.4.local_conv.bn.running_var", "image_encoder.layers.2.blocks.4.local_conv.bn.num_batches_tracked", "image_encoder.layers.2.blocks.5.attn.attention_biases", "image_encoder.layers.2.blocks.5.attn.norm.weight", "image_encoder.layers.2.blocks.5.attn.norm.bias", "image_encoder.layers.2.blocks.5.attn.qkv.weight", "image_encoder.layers.2.blocks.5.attn.qkv.bias", "image_encoder.layers.2.blocks.5.attn.proj.weight", "image_encoder.layers.2.blocks.5.attn.proj.bias", "image_encoder.layers.2.blocks.5.mlp.norm.weight", "image_encoder.layers.2.blocks.5.mlp.norm.bias", "image_encoder.layers.2.blocks.5.mlp.fc1.weight", "image_encoder.layers.2.blocks.5.mlp.fc1.bias", "image_encoder.layers.2.blocks.5.mlp.fc2.weight", "image_encoder.layers.2.blocks.5.mlp.fc2.bias", "image_encoder.layers.2.blocks.5.local_conv.c.weight", "image_encoder.layers.2.blocks.5.local_conv.bn.weight", "image_encoder.layers.2.blocks.5.local_conv.bn.bias", "image_encoder.layers.2.blocks.5.local_conv.bn.running_mean", "image_encoder.layers.2.blocks.5.local_conv.bn.running_var", "image_encoder.layers.2.blocks.5.local_conv.bn.num_batches_tracked", "image_encoder.layers.2.downsample.conv1.c.weight", "image_encoder.layers.2.downsample.conv1.bn.weight", "image_encoder.layers.2.downsample.conv1.bn.bias", "image_encoder.layers.2.downsample.conv1.bn.running_mean", "image_encoder.layers.2.downsample.conv1.bn.running_var", "image_encoder.layers.2.downsample.conv1.bn.num_batches_tracked", "image_encoder.layers.2.downsample.conv2.c.weight", "image_encoder.layers.2.downsample.conv2.bn.weight", "image_encoder.layers.2.downsample.conv2.bn.bias", "image_encoder.layers.2.downsample.conv2.bn.running_mean", "image_encoder.layers.2.downsample.conv2.bn.running_var", "image_encoder.layers.2.downsample.conv2.bn.num_batches_tracked", "image_encoder.layers.2.downsample.conv3.c.weight", "image_encoder.layers.2.downsample.conv3.bn.weight", "image_encoder.layers.2.downsample.conv3.bn.bias", "image_encoder.layers.2.downsample.conv3.bn.running_mean", "image_encoder.layers.2.downsample.conv3.bn.running_var", "image_encoder.layers.2.downsample.conv3.bn.num_batches_tracked", "image_encoder.layers.3.blocks.0.attn.attention_biases", "image_encoder.layers.3.blocks.0.attn.norm.weight", "image_encoder.layers.3.blocks.0.attn.norm.bias", "image_encoder.layers.3.blocks.0.attn.qkv.weight", "image_encoder.layers.3.blocks.0.attn.qkv.bias", "image_encoder.layers.3.blocks.0.attn.proj.weight", "image_encoder.layers.3.blocks.0.attn.proj.bias", "image_encoder.layers.3.blocks.0.mlp.norm.weight", "image_encoder.layers.3.blocks.0.mlp.norm.bias", "image_encoder.layers.3.blocks.0.mlp.fc1.weight", "image_encoder.layers.3.blocks.0.mlp.fc1.bias", "image_encoder.layers.3.blocks.0.mlp.fc2.weight", "image_encoder.layers.3.blocks.0.mlp.fc2.bias", "image_encoder.layers.3.blocks.0.local_conv.c.weight", "image_encoder.layers.3.blocks.0.local_conv.bn.weight", "image_encoder.layers.3.blocks.0.local_conv.bn.bias", "image_encoder.layers.3.blocks.0.local_conv.bn.running_mean", "image_encoder.layers.3.blocks.0.local_conv.bn.running_var", "image_encoder.layers.3.blocks.0.local_conv.bn.num_batches_tracked", "image_encoder.layers.3.blocks.1.attn.attention_biases", "image_encoder.layers.3.blocks.1.attn.norm.weight", "image_encoder.layers.3.blocks.1.attn.norm.bias", "image_encoder.layers.3.blocks.1.attn.qkv.weight", "image_encoder.layers.3.blocks.1.attn.qkv.bias", "image_encoder.layers.3.blocks.1.attn.proj.weight", "image_encoder.layers.3.blocks.1.attn.proj.bias", "image_encoder.layers.3.blocks.1.mlp.norm.weight", "image_encoder.layers.3.blocks.1.mlp.norm.bias", "image_encoder.layers.3.blocks.1.mlp.fc1.weight", "image_encoder.layers.3.blocks.1.mlp.fc1.bias", "image_encoder.layers.3.blocks.1.mlp.fc2.weight", "image_encoder.layers.3.blocks.1.mlp.fc2.bias", "image_encoder.layers.3.blocks.1.local_conv.c.weight", "image_encoder.layers.3.blocks.1.local_conv.bn.weight", "image_encoder.layers.3.blocks.1.local_conv.bn.bias", "image_encoder.layers.3.blocks.1.local_conv.bn.running_mean", "image_encoder.layers.3.blocks.1.local_conv.bn.running_var", "image_encoder.layers.3.blocks.1.local_conv.bn.num_batches_tracked", "image_encoder.norm_head.weight", "image_encoder.norm_head.bias", "image_encoder.head.weight", "image_encoder.head.bias", "image_encoder.patch_embed.seq.0.c.weight", "image_encoder.patch_embed.seq.0.bn.weight", "image_encoder.patch_embed.seq.0.bn.bias", "image_encoder.patch_embed.seq.0.bn.running_mean", "image_encoder.patch_embed.seq.0.bn.running_var", "image_encoder.patch_embed.seq.0.bn.num_batches_tracked", "image_encoder.patch_embed.seq.2.c.weight", "image_encoder.patch_embed.seq.2.bn.weight", "image_encoder.patch_embed.seq.2.bn.bias", "image_encoder.patch_embed.seq.2.bn.running_mean", "image_encoder.patch_embed.seq.2.bn.running_var", "image_encoder.patch_embed.seq.2.bn.num_batches_tracked", "mask_decoder.hf_token.weight", "mask_decoder.hf_mlp.layers.0.weight", "mask_decoder.hf_mlp.layers.0.bias", "mask_decoder.hf_mlp.layers.1.weight", "mask_decoder.hf_mlp.layers.1.bias", "mask_decoder.hf_mlp.layers.2.weight", "mask_decoder.hf_mlp.layers.2.bias", "mask_decoder.compress_vit_feat.0.weight", "mask_decoder.compress_vit_feat.0.bias", "mask_decoder.compress_vit_feat.1.weight", "mask_decoder.compress_vit_feat.1.bias", "mask_decoder.compress_vit_feat.3.weight", "mask_decoder.compress_vit_feat.3.bias", "mask_decoder.embedding_encoder.0.weight", "mask_decoder.embedding_encoder.0.bias", "mask_decoder.embedding_encoder.1.weight", "mask_decoder.embedding_encoder.1.bias", "mask_decoder.embedding_encoder.3.weight", "mask_decoder.embedding_encoder.3.bias", "mask_decoder.embedding_maskfeature.0.weight", "mask_decoder.embedding_maskfeature.0.bias", "mask_decoder.embedding_maskfeature.1.weight", "mask_decoder.embedding_maskfeature.1.bias", "mask_decoder.embedding_maskfeature.3.weight", "mask_decoder.embedding_maskfeature.3.bias". size mismatch for image_encoder.neck.0.weight: copying a param with shape torch.Size([256, 320, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 768, 1, 1]).

bobo59 commented 1 week ago

Hello, I have the same problem, have you solved it yet?