InternLM / InternLM-XComposer

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Apache License 2.0
2.46k stars 153 forks source link

ZERO3 + Offload CPU Error when fine-tuning InternLM-XComposer2 #374

Open Coobiw opened 2 months ago

Coobiw commented 2 months ago

Hi, Thanks for your great work! When I fine-tune InternLM-XComposer2(unfreeze the proj and the whole LLM, freeze vit). In order to avoid OOM, I use zero3 and offload the optimizer to CPU(by modifying the https://github.com/InternLM/InternLM-XComposer/blob/main/InternLM-XComposer-2.0/finetune/ds_config_zero2.json#L17 to cpu). I find an error as following. The original ds_config_zero2.json will not raise this. How can I solve it. Thanks for your advice and reply!

Error Message:

Traceback (most recent call last):
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/qbw/ChartLLM/InternLM-XComposer/finetune/finetune_smoe.py", line 396, in <module>
    train()
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/qbw/ChartLLM/InternLM-XComposer/finetune/finetune_smoe.py", line 297, in train
    model = transformers.AutoModelForCausalLM.from_pretrained(
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/miniconda3/envs/intern_clean/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
    return model_class.from_pretrained(
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/miniconda3/envs/intern_clean/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2966, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/miniconda3/envs/intern_clean/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/qbw/cache/huggingface/modules/transformers_modules/internlm-xcomposer2-vl-7b/modeling_internlm_xcomposer2.py", line 67, in __init__
    self.vit = build_vision_tower()
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/qbw/cache/huggingface/modules/transformers_modules/internlm-xcomposer2-vl-7b/build_mlp.py", line 11, in build_vision_tower
    return CLIPVisionTower(vision_tower)
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/miniconda3/envs/intern_clean/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/qbw/cache/huggingface/modules/transformers_modules/internlm-xcomposer2-vl-7b/build_mlp.py", line 59, in __init__
    self.resize_pos()
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/qbw/cache/huggingface/modules/transformers_modules/internlm-xcomposer2-vl-7b/build_mlp.py", line 85, in resize_pos
    pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size,
RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 24, 24, 0] because the unspecified dimension size -1 can be any value and is ambiguous
YerongLi commented 6 days ago

I got a similar error, did you manage to fix this?