Open choyakawa opened 6 months ago
@choyakawa , HI!
Thank you for your attention. The training script for LLaVA1.6 (Next) has not been released yet. We will try to follow up once it is released.
Hi @choyakawa
@hhaAndroid is working on it.
Please subscribe https://github.com/InternLM/xtuner/pull/460!
Failed on llava_internlm2_chat_7b_clip_vit_large_p14_anyshape_e1_gpu8_pretrain
with deepspeed zero3, is there anything wrong?
NCCL_IB_TIMEOUT=120 XTUNER_DATASET_TIMEOUT=120 NCCL_DEBUG=INFO NPROC_PER_NODE=8 NNODES=4 PORT=12345 ADDR=server0 NODE_RANK=0 xtuner train llava_internlm2_chat_7b_clip_vit_large_p14_anyshape_e1_gpu8_pretrain --deepspeed deepspeed_zero3
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/home/user/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
File "/home/user/.local/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([92544, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
File "/home/user/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/home/user/.local/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
File "/home/user/.local/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
raise ValueError( raise ValueError(
ValueError: ValueErrorTrying to set a tensor of shape torch.Size([92544, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.:
Trying to set a tensor of shape torch.Size([92544, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
[2024-03-17 12:29:20,707] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 79735) of binary: /usr/local/bin/python3
Traceback (most recent call last):
zero2 is ok, but replicating LLaVA 1.6 with 34B model is challenging without zero3
@LZHgrla Do you have any idea on the failure of zero3? I am having no idea why the image features from clip has shape torch.Size([0]) here. It seems that batchsize>1 on zero 2 won't work either.
@choyakawa
Quantization is not compatible with zero3. So, you should remove the quantization_config
of model.llm
when using zero3.
The features of LLaVA 1.6 are still WIP. If you have any advanced attempts (such as application of 34B LLM), you are welcome to provide detailed configs and executable commands. We will conduct some tests after development to improve the robustness.
I am not using quantization, the above failure was on bf16. And I have also tried open_clip instead of openai vit-L, not working.
@LZHgrla
这个报错信息,该怎么解决呢?
RuntimeError: The expanded size of the tensor (4096) must match the existing size (0) at non-singleton dimension 0. Target sizes: [4096, 32, 1]. Tensor sizes: [0, 1, 1]
model = self.train_loop.run() # type: ignore
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 270, in run
self.runner.call_hook('before_train')
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1271, in call_hook
getattr(hook, fn_name)(self, **kwargs)
File "/export/App/training_platform/PinoModel/xtuner/xtuner/engine/hooks/evaluate_chat_hook.py", line 221, in before_train
self._generate_samples(runner, max_new_tokens=50)
File "/export/App/training_platform/PinoModel/xtuner/xtuner/engine/hooks/evaluate_chat_hook.py", line 207, in _generate_samples
self._eval_images(runner, model, device, max_new_tokens,
File "/export/App/training_platform/PinoModel/xtuner/xtuner/engine/hooks/anyshape_evaluate_chat_hook.py", line 53, in _eval_images
image_features = model.preprocess_for_pixel_values({
File "/export/App/training_platform/PinoModel/xtuner/xtuner/model/anyshape_llava.py", line 109, in preprocess_for_pixel_values
self.image_newline[:, None, None].expand(
RuntimeError: The expanded size of the tensor (4096) must match the existing size (0) at non-singleton dimension 0. Target sizes: [4096, 32, 1]. Tensor sizes: [0, 1, 1]
[2024-04-24 13:52:01,685] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2444983) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
Support finetuning LLaVA 1.6