InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
3.74k stars 302 forks source link

Support finetuning LLaVA 1.6 #432

Open choyakawa opened 6 months ago

choyakawa commented 6 months ago

Support finetuning LLaVA 1.6

LZHgrla commented 6 months ago

@choyakawa , HI!

Thank you for your attention. The training script for LLaVA1.6 (Next) has not been released yet. We will try to follow up once it is released.

LZHgrla commented 6 months ago

Hi @choyakawa

@hhaAndroid is working on it.

Please subscribe https://github.com/InternLM/xtuner/pull/460!

choyakawa commented 6 months ago

Failed on llava_internlm2_chat_7b_clip_vit_large_p14_anyshape_e1_gpu8_pretrain with deepspeed zero3, is there anything wrong? NCCL_IB_TIMEOUT=120 XTUNER_DATASET_TIMEOUT=120 NCCL_DEBUG=INFO NPROC_PER_NODE=8 NNODES=4 PORT=12345 ADDR=server0 NODE_RANK=0 xtuner train llava_internlm2_chat_7b_clip_vit_large_p14_anyshape_e1_gpu8_pretrain --deepspeed deepspeed_zero3

set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/home/user/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
  File "/home/user/.local/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([92544, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(  
                                                                                                   ^  ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/user/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
  File "/home/user/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/home/user/.local/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
  File "/home/user/.local/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
    raise ValueError(    raise ValueError(

ValueError: ValueErrorTrying to set a tensor of shape torch.Size([92544, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.: 
Trying to set a tensor of shape torch.Size([92544, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
[2024-03-17 12:29:20,707] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 79735) of binary: /usr/local/bin/python3
Traceback (most recent call last):
choyakawa commented 6 months ago

zero2 is ok, but replicating LLaVA 1.6 with 34B model is challenging without zero3

choyakawa commented 6 months ago

@LZHgrla Do you have any idea on the failure of zero3? I am having no idea why the image features from clip has shape torch.Size([0]) here. It seems that batchsize>1 on zero 2 won't work either.

LZHgrla commented 6 months ago

@choyakawa Quantization is not compatible with zero3. So, you should remove the quantization_config of model.llm when using zero3.

The features of LLaVA 1.6 are still WIP. If you have any advanced attempts (such as application of 34B LLM), you are welcome to provide detailed configs and executable commands. We will conduct some tests after development to improve the robustness.

choyakawa commented 6 months ago

I am not using quantization, the above failure was on bf16. And I have also tried open_clip instead of openai vit-L, not working.

awzhgw commented 4 months ago

@LZHgrla

这个报错信息,该怎么解决呢?

RuntimeError: The expanded size of the tensor (4096) must match the existing size (0) at non-singleton dimension 0.  Target sizes: [4096, 32, 1].  Tensor sizes: [0, 1, 1]
    model = self.train_loop.run()  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 270, in run
    self.runner.call_hook('before_train')
  File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1271, in call_hook
    getattr(hook, fn_name)(self, **kwargs)
  File "/export/App/training_platform/PinoModel/xtuner/xtuner/engine/hooks/evaluate_chat_hook.py", line 221, in before_train
    self._generate_samples(runner, max_new_tokens=50)
  File "/export/App/training_platform/PinoModel/xtuner/xtuner/engine/hooks/evaluate_chat_hook.py", line 207, in _generate_samples
    self._eval_images(runner, model, device, max_new_tokens,
  File "/export/App/training_platform/PinoModel/xtuner/xtuner/engine/hooks/anyshape_evaluate_chat_hook.py", line 53, in _eval_images
    image_features = model.preprocess_for_pixel_values({
  File "/export/App/training_platform/PinoModel/xtuner/xtuner/model/anyshape_llava.py", line 109, in preprocess_for_pixel_values
    self.image_newline[:, None, None].expand(
RuntimeError: The expanded size of the tensor (4096) must match the existing size (0) at non-singleton dimension 0.  Target sizes: [4096, 32, 1].  Tensor sizes: [0, 1, 1]
[2024-04-24 13:52:01,685] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2444983) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================