InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
3.92k stars 305 forks source link

Failed to convert LLaVA Llama3 .pth to HuggingFace format #827

Closed Mikael17125 closed 3 months ago

Mikael17125 commented 3 months ago

After finetuned, I can convert the .pth to official and xtuner format, however I cannot convert to huggingface format because some errors, please help me:

xtuner convert pth_to_hf llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py iter_752.pth/ iter_752_hf --safe-serialization --save-format huggingface
[2024-07-11 13:21:20,077] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[2024-07-11 13:21:23,585] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.80it/s]
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
Processing zero checkpoint 'iter_752.pth/'
Load Checkpoints: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:08<00:00,  2.11s/it]
Detected checkpoint of type zero stage 2, world_size: 4
Parsing checkpoint created by deepspeed==0.14.4
Reconstructed state dict with 452 params 1363156992 elements
Load State Dict: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 452/452 [00:00<00:00, 958.63it/s]
07/11 13:21:50 - mmengine - INFO - Load PTH model from iter_752.pth/
07/11 13:22:03 - mmengine - INFO - Convert LLM to float16
Traceback (most recent call last):
  File "/home/oem/xtuner/xtuner/tools/model_converters/pth_to_hf.py", line 139, in <module>
    main()
  File "/home/oem/xtuner/xtuner/tools/model_converters/pth_to_hf.py", line 127, in main
    model.to_hf(
  File "/home/oem/xtuner/xtuner/model/llava.py", line 345, in to_hf
    self.to_huggingface_llava(cfg, save_dir, fp32,
  File "/home/oem/xtuner/xtuner/model/llava.py", line 474, in to_huggingface_llava
    model.load_state_dict(state_dict, strict=True, assign=True)
  File "/home/oem/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlavaForConditionalGeneration:
    Missing key(s) in state_dict: "vision_tower.vision_model.embeddings.class_embedding", "vision_tower.vision_model.embeddings.patch_embedding.weight", "vision_tower.vision_model.embeddings.position_embedding.weight", "vision_tower.vision_model.pre_layrnorm.weight", "vision_tower.vision_model.pre_layrnorm.bias", "vision_tower.vision_model.encoder.layers.0.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.0.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.0.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.0.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.0.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.0.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.0.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.0.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.0.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.0.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.0.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.0.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.0.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.0.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.0.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.0.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.1.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.1.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.1.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.1.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.1.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.1.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.1.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.1.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.1.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.1.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.1.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.1.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.1.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.1.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.1.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.1.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.2.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.2.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.2.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.2.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.2.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.2.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.2.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.2.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.2.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.2.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.2.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.2.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.2.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.2.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.2.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.2.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.3.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.3.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.3.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.3.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.3.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.3.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.3.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.3.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.3.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.3.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.3.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.3.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.3.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.3.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.3.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.3.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.4.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.4.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.4.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.4.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.4.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.4.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.4.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.4.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.4.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.4.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.4.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.4.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.4.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.4.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.4.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.4.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.5.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.5.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.5.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.5.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.5.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.5.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.5.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.5.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.5.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.5.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.5.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.5.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.5.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.5.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.5.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.5.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.6.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.6.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.6.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.6.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.6.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.6.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.6.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.6.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.6.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.6.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.6.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.6.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.6.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.6.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.6.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.6.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.7.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.7.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.7.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.7.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.7.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.7.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.7.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.7.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.7.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.7.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.7.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.7.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.7.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.7.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.7.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.7.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.8.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.8.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.8.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.8.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.8.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.8.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.8.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.8.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.8.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.8.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.8.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.8.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.8.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.8.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.8.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.8.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.9.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.9.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.9.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.9.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.9.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.9.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.9.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.9.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.9.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.9.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.9.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.9.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.10.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.10.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.10.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.10.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.10.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.10.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.10.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.10.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.10.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.10.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.10.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.10.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.10.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.10.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.10.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.10.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.11.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.11.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.11.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.11.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.11.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.11.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.11.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.11.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.11.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.11.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.11.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.11.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.11.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.11.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.11.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.11.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.12.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.12.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.12.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.12.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.12.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.12.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.12.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.12.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.12.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.12.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.12.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.12.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.12.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.12.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.12.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.12.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.13.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.13.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.13.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.13.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.13.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.13.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.13.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.13.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.13.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.13.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.13.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.13.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.13.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.13.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.13.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.13.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.14.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.14.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.14.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.14.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.14.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.14.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.14.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.14.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.14.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.14.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.14.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.14.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.14.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.14.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.14.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.14.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.15.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.15.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.15.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.15.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.15.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.15.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.15.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.15.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.15.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.15.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.15.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.15.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.15.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.15.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.15.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.15.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.16.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.16.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.16.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.16.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.16.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.16.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.16.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.16.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.16.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.16.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.16.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.16.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.16.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.16.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.16.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.16.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.17.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.17.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.17.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.17.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.17.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.17.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.17.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.17.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.17.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.17.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.17.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.17.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.17.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.17.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.17.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.17.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.18.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.18.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.18.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.18.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.18.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.18.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.18.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.18.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.18.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.18.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.18.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.18.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.18.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.18.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.18.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.18.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.19.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.19.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.19.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.19.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.19.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.19.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.19.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.19.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.19.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.19.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.19.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.19.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.19.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.19.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.19.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.19.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.20.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.20.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.20.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.20.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.20.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.20.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.20.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.20.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.20.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.20.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.20.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.20.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.20.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.20.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.20.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.20.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.21.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.21.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.21.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.21.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.21.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.21.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.21.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.21.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.21.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.21.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.21.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.21.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.21.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.21.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.21.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.21.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.22.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.22.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.22.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.22.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.22.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.22.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.22.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.22.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.22.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.22.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.22.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.22.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.22.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.22.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.22.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.22.layer_norm2.bias", "vision_tower.vision_model.encoder.layers.23.self_attn.k_proj.weight", "vision_tower.vision_model.encoder.layers.23.self_attn.k_proj.bias", "vision_tower.vision_model.encoder.layers.23.self_attn.v_proj.weight", "vision_tower.vision_model.encoder.layers.23.self_attn.v_proj.bias", "vision_tower.vision_model.encoder.layers.23.self_attn.q_proj.weight", "vision_tower.vision_model.encoder.layers.23.self_attn.q_proj.bias", "vision_tower.vision_model.encoder.layers.23.self_attn.out_proj.weight", "vision_tower.vision_model.encoder.layers.23.self_attn.out_proj.bias", "vision_tower.vision_model.encoder.layers.23.layer_norm1.weight", "vision_tower.vision_model.encoder.layers.23.layer_norm1.bias", "vision_tower.vision_model.encoder.layers.23.mlp.fc1.weight", "vision_tower.vision_model.encoder.layers.23.mlp.fc1.bias", "vision_tower.vision_model.encoder.layers.23.mlp.fc2.weight", "vision_tower.vision_model.encoder.layers.23.mlp.fc2.bias", "vision_tower.vision_model.encoder.layers.23.layer_norm2.weight", "vision_tower.vision_model.encoder.layers.23.layer_norm2.bias", "vision_tower.vision_model.post_layernorm.weight", "vision_tower.vision_model.post_layernorm.bias".

Here is my finetuned config:

# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
                            LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
                          CLIPImageProcessor, CLIPVisionModel)

from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE

#######################################################################
#                          PART 1  Settings                           #
#######################################################################
# Model
llm_name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = '/home/oem/xtuner/pretrained/iter_9742.pth'  # noqa: E501

# Data
data_root = '/home/oem/xtuner/screen_agent/'
data_path = data_root + 'screen_agent_instruct.json'
image_folder = data_root + 'images/'
prompt_template = PROMPT_TEMPLATE.llama3_chat
max_length = int(2048 - (336 / 14)**2)

# Scheduler & Optimizer
batch_size = 1  # per_device
accumulative_counts = 2
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1  # grad clip
warmup_ratio = 0.03

# Save
save_steps = 100
save_total_limit = 2  # Maximum checkpoints to keep (-1 means unlimited)

# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = ''
evaluation_images = 'screen_agent/images/image_0045.jpg'
evaluation_inputs = [ 'Please describe this picture']

#######################################################################
#            PART 2  Model & Tokenizer & Image Processor              #
#######################################################################
tokenizer = dict(
    type=AutoTokenizer.from_pretrained,
    pretrained_model_name_or_path=llm_name_or_path,
    trust_remote_code=True,
    padding_side='right')

image_processor = dict(
    type=CLIPImageProcessor.from_pretrained,
    pretrained_model_name_or_path=visual_encoder_name_or_path,
    trust_remote_code=True)

model = dict(
    type=LLaVAModel,
    freeze_llm=False,
    freeze_visual_encoder=True,
    pretrained_pth=pretrained_pth,
    llm=dict(
        type=AutoModelForCausalLM.from_pretrained,
        pretrained_model_name_or_path=llm_name_or_path,
        trust_remote_code=True),
    llm_lora=dict(
        type=LoraConfig,
        r=512,
        lora_alpha=16,
        lora_dropout=0.05,
        bias='none',
        task_type='CAUSAL_LM'),
    visual_encoder=dict(
        type=CLIPVisionModel.from_pretrained,
        pretrained_model_name_or_path=visual_encoder_name_or_path)
    )

#######################################################################
#                      PART 3  Dataset & Dataloader                   #
#######################################################################
llava_dataset = dict(
    type=LLaVADataset,
    data_path=data_path,
    image_folder=image_folder,
    tokenizer=tokenizer,
    image_processor=image_processor,
    dataset_map_fn=llava_map_fn,
    template_map_fn=dict(
        type=template_map_fn_factory, template=prompt_template),
    max_length=max_length,
    pad_image_to_square=True)

train_dataloader = dict(
    batch_size=batch_size,
    num_workers=dataloader_num_workers,
    pin_memory=True,
    dataset=llava_dataset,
    sampler=dict(
        type=LengthGroupedSampler,
        length_property='modality_length',
        per_device_batch_size=batch_size * accumulative_counts),
    collate_fn=dict(type=default_collate_fn))

#######################################################################
#                    PART 4  Scheduler & Optimizer                    #
#######################################################################
# optimizer
optim_wrapper = dict(
    type=AmpOptimWrapper,
    optimizer=dict(
        type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
    clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
    accumulative_counts=accumulative_counts,
    loss_scale='dynamic',
    dtype='float16')

# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md  # noqa: E501
param_scheduler = [
    dict(
        type=LinearLR,
        start_factor=1e-5,
        by_epoch=True,
        begin=0,
        end=warmup_ratio * max_epochs,
        convert_to_iter_based=True),
    dict(
        type=CosineAnnealingLR,
        eta_min=0.0,
        by_epoch=True,
        begin=warmup_ratio * max_epochs,
        end=max_epochs,
        convert_to_iter_based=True)
]

# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)

#######################################################################
#                           PART 5  Runtime                           #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
    dict(type=DatasetInfoHook, tokenizer=tokenizer),
    dict(
        type=EvaluateChatHook,
        tokenizer=tokenizer,
        image_processor=image_processor,
        every_n_iters=evaluation_freq,
        evaluation_inputs=evaluation_inputs,
        evaluation_images=evaluation_images,
        system=SYSTEM,
        prompt_template=prompt_template)
]

# configure default hooks
default_hooks = dict(
    # record the time of every iteration.
    timer=dict(type=IterTimerHook),
    # print log every 10 iterations.
    logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
    # enable the parameter scheduler.
    param_scheduler=dict(type=ParamSchedulerHook),
    # save checkpoint per `save_steps`.
    checkpoint=dict(
        type=CheckpointHook,
        by_epoch=False,
        interval=save_steps,
        max_keep_ckpts=save_total_limit),
    # set sampler seed in distributed evrionment.
    sampler_seed=dict(type=DistSamplerSeedHook),
)

# configure environment
env_cfg = dict(
    # whether to enable cudnn benchmark
    cudnn_benchmark=False,
    # set multi process parameters
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
    # set distributed parameters
    dist_cfg=dict(backend='nccl'),
)

# set visualizer
visualizer = None

# set log level
log_level = 'INFO'

# load from which checkpoint
load_from = None

# whether to resume training from the loaded checkpoint
resume = False

# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)

# set log processor
log_processor = dict(by_epoch=False)
hhaAndroid commented 3 months ago

@Mikael17125 I've looked at the code and there does seem to be an issue. it hasn't considered the case where the ViT (Visual Transformer) doesn't need training. You can simply force the need_visual_encoder flag to True to address this. https://github.com/InternLM/xtuner/blob/main/xtuner/model/llava.py#L574

Mikael17125 commented 3 months ago

That's works, I tried to finetune the ViT and it's works well.