RuntimeError: Error(s) in loading state_dict for STCConnector: while finetuning lora.sh

``[2024-07-09` 05:05:52,623] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-09 05:06:01,363] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-09 05:06:01,363] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type mistral to instantiate a model of type llama. This is not supported for all configurations of models and can yield errors.
[2024-07-09 05:06:02,338] [INFO] [partition_parameters.py:349:__exit__] finished initializing model - num_params = 404, num_elems = 7.73B
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.09s/it]
Adding LoRA adapters...
[2024-07-09 05:07:47,385] [INFO] [partition_parameters.py:349:__exit__] finished initializing model - num_params = 795, num_elems = 8.03B
Traceback (most recent call last):
  File "/workspace/VideoLLaMA2/videollama2/train_flash_attn.py", line 12, in <module>
    train(attn_implementation="flash_attention_2")
  File "/workspace/VideoLLaMA2/./videollama2/train.py", line 800, in train
    model.get_model().initialize_vision_modules(model_args=model_args, fsdp=training_args.fsdp)
  File "/workspace/VideoLLaMA2/./videollama2/model/videollama2_arch.py", line 99, in initialize_vision_modules
    self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector'), strict=False)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for STCConnector:
        size mismatch for s1.b1.conv1.conv.weight: copying a param with shape torch.Size([4096, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b1.conv1.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b1.conv1.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b1.conv2.conv.weight: copying a param with shape torch.Size([4096, 1, 3, 3]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b1.conv2.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b1.conv2.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b1.se.fc1.weight: copying a param with shape torch.Size([256, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b1.se.fc1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b1.se.fc2.weight: copying a param with shape torch.Size([4096, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b1.se.fc2.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b1.conv3.conv.weight: copying a param with shape torch.Size([4096, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b1.conv3.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b1.conv3.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b1.downsample.conv.weight: copying a param with shape torch.Size([4096, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b1.downsample.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b1.downsample.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b2.conv1.conv.weight: copying a param with shape torch.Size([4096, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b2.conv1.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b2.conv1.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b2.conv2.conv.weight: copying a param with shape torch.Size([4096, 1, 3, 3]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b2.conv2.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b2.conv2.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b2.se.fc1.weight: copying a param with shape torch.Size([1024, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b2.se.fc1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b2.se.fc2.weight: copying a param with shape torch.Size([4096, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b2.se.fc2.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b2.conv3.conv.weight: copying a param with shape torch.Size([4096, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b2.conv3.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b2.conv3.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b3.conv1.conv.weight: copying a param with shape torch.Size([4096, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b3.conv1.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b3.conv1.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b3.conv2.conv.weight: copying a param with shape torch.Size([4096, 1, 3, 3]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b3.conv2.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b3.conv2.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b3.se.fc1.weight: copying a param with shape torch.Size([1024, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b3.se.fc1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b3.se.fc2.weight: copying a param with shape torch.Size([4096, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b3.se.fc2.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b3.conv3.conv.weight: copying a param with shape torch.Size([4096, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b3.conv3.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b3.conv3.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b4.conv1.conv.weight: copying a param with shape torch.Size([4096, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b4.conv1.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b4.conv1.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b4.conv2.conv.weight: copying a param with shape torch.Size([4096, 1, 3, 3]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b4.conv2.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b4.conv2.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b4.se.fc1.weight: copying a param with shape torch.Size([1024, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b4.se.fc1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b4.se.fc2.weight: copying a param with shape torch.Size([4096, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b4.se.fc2.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b4.conv3.conv.weight: copying a param with shape torch.Size([4096, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b4.conv3.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s1.b4.conv3.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for sampler.0.weight: copying a param with shape torch.Size([4096, 4096, 2, 2, 2]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for sampler.0.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b1.conv1.conv.weight: copying a param with shape torch.Size([4096, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b1.conv1.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b1.conv1.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b1.conv2.conv.weight: copying a param with shape torch.Size([4096, 1, 3, 3]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b1.conv2.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b1.conv2.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b1.se.fc1.weight: copying a param with shape torch.Size([1024, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b1.se.fc1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b1.se.fc2.weight: copying a param with shape torch.Size([4096, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b1.se.fc2.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b1.conv3.conv.weight: copying a param with shape torch.Size([4096, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b1.conv3.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b1.conv3.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b2.conv1.conv.weight: copying a param with shape torch.Size([4096, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b2.conv1.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b2.conv1.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b2.conv2.conv.weight: copying a param with shape torch.Size([4096, 1, 3, 3]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b2.conv2.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b2.conv2.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b2.se.fc1.weight: copying a param with shape torch.Size([1024, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b2.se.fc1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b2.se.fc2.weight: copying a param with shape torch.Size([4096, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b2.se.fc2.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b2.conv3.conv.weight: copying a param with shape torch.Size([4096, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b2.conv3.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b2.conv3.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b3.conv1.conv.weight: copying a param with shape torch.Size([4096, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b3.conv1.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b3.conv1.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b3.conv2.conv.weight: copying a param with shape torch.Size([4096, 1, 3, 3]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b3.conv2.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b3.conv2.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b3.se.fc1.weight: copying a param with shape torch.Size([1024, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b3.se.fc1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b3.se.fc2.weight: copying a param with shape torch.Size([4096, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b3.se.fc2.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b3.conv3.conv.weight: copying a param with shape torch.Size([4096, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b3.conv3.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b3.conv3.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b4.conv1.conv.weight: copying a param with shape torch.Size([4096, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b4.conv1.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b4.conv1.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b4.conv2.conv.weight: copying a param with shape torch.Size([4096, 1, 3, 3]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b4.conv2.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b4.conv2.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b4.se.fc1.weight: copying a param with shape torch.Size([1024, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b4.se.fc1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b4.se.fc2.weight: copying a param with shape torch.Size([4096, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b4.se.fc2.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b4.conv3.conv.weight: copying a param with shape torch.Size([4096, 4096, 1, 1]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b4.conv3.bn.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for s2.b4.conv3.bn.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for readout.0.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for readout.0.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for readout.2.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for readout.2.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11184) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
videollama2/train_flash_attn.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-09_05:07:53
  host      : 3912a2da3279
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 11184)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================`

I have updated the finetuning script according to #40 and got a new issue.
DAMO-NLP-SG / VideoLLaMA2

RuntimeError: Error(s) in loading state_dict for STCConnector: while finetuning lora.sh #41