OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
https://internvl.readthedocs.io/en/latest/
MIT License
6.06k stars 471 forks source link

[Bug] Fine-tuning according to the documentation does not pass #455

Open a914356887 opened 3 months ago

a914356887 commented 3 months ago

Checklist

Describe the bug

image According to the official documentation, the execution of GPUS=4 PER_DEVICE_BATCH_SIZE=4 sh shell/internvl2.0/2nd_finetune/internvl2_2b_internlm2_1_8b_dynamic_res_2nd_finetune_lora.sh failed. Document address: https://internvl.readthedocs.io/en/latest/tutorials/coco_caption_finetune.html

Reproduction

GPUS=4 PER_DEVICE_BATCH_SIZE=4 sh shell/internvl2.0/2nd_finetune/internvl2_2b_internlm2_1_8b_dynamic_res_2nd_finetune_lora.sh

Environment

1. Python:3.9.19
2. cuda_12.4.r12.4/compiler.33961263_0
3. Driver Version: 550.54.14
4. transformers :4.37.2
5. torch:2.4.0

Error traceback

+ [ ! -d work_dirs/internvl_chat_v2_0/internvl2_2b_internlm2_1_8b_dynamic_res_2nd_finetune_lora ]
+ torchrun --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --nproc_per_node=4 --master_port=34229 internvl/train/internvl_chat_finetune.py --model_name_or_path ./pretrained/InternVL2-2B --conv_style internlm2-chat --output_dir work_dirs/internvl_chat_v2_0/internvl2_2b_internlm2_1_8b_dynamic_res_2nd_finetune_lora --meta_path ./shell/data/internvl_1_2_finetune_custom.json --overwrite_output_dir True --force_image_size 448 --max_dynamic_patch 6 --down_sample_ratio 0.5 --drop_path_rate 0.0 --freeze_llm True --freeze_mlp True --freeze_backbone True --use_llm_lora 16 --vision_select_layer -1 --dataloader_num_workers 4 --bf16 True --num_train_epochs 1 --per_device_train_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps+  200 --save_total_limit 1 --learning_rate 4e-5 --weight_decay 0.01 --warmup_ratio 0.03 --lr_scheduler_typetee cosine -a --logging_steps work_dirs/internvl_chat_v2_0/internvl2_2b_internlm2_1_8b_dynamic_res_2nd_finetune_lora/training_log.txt 1
 --max_seq_length 4096 --do_train True --grad_checkpoint True --group_by_length True --dynamic_image_size True --use_thumbnail True --ps_version v2 --deepspeed zero_stage1_config.json --report_to tensorboard
W0805 09:29:53.681633 123911748675072 torch/distributed/run.py:779] 
W0805 09:29:53.681633 123911748675072 torch/distributed/run.py:779] *****************************************
W0805 09:29:53.681633 123911748675072 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0805 09:29:53.681633 123911748675072 torch/distributed/run.py:779] *****************************************
[2024-08-05 09:29:57,255] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-05 09:29:57,257] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-05 09:29:57,257] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-05 09:29:57,257] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias=None):
/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias=None):
/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias=None):
/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias=None):
/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
Traceback (most recent call last):
  File "/root/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 18, in <module>
    from internvl.dist_utils import init_dist
  File "/root/InternVL/internvl_chat/internvl/dist_utils.py", line 6, in <module>
    import deepspeed
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/__init__.py", line 26, in <module>
    from . import module_inject
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/module_inject/__init__.py", line 6, in <module>
    from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 607, in <module>
    from ..pipe import PipelineModule
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/pipe/__init__.py", line 6, in <module>
    from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/__init__.py", line 6, in <module>
    from .module import PipelineModule, LayerSpec, TiedLayerSpec
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/module.py", line 19, in <module>
    from ..activation_checkpointing import checkpointing
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 26, in <module>
    Traceback (most recent call last):
from deepspeed.runtime.config import DeepSpeedConfig
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 42, in <module>
  File "/root/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 18, in <module>
    from ..elasticity import (
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/elasticity/__init__.py", line 10, in <module>
    from .elastic_agent import DSElasticAgent
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/elasticity/elastic_agent.py", line 9, in <module>
    from internvl.dist_utils import init_dist
  File "/root/InternVL/internvl_chat/internvl/dist_utils.py", line 6, in <module>
    from torch.distributed.elastic.agent.server.api import log, _get_socket_with_port
ImportError: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py)
    import deepspeed
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/__init__.py", line 26, in <module>
    from . import module_inject
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/module_inject/__init__.py", line 6, in <module>
    from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 607, in <module>
    from ..pipe import PipelineModule
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/pipe/__init__.py", line 6, in <module>
    from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/__init__.py", line 6, in <module>
    from .module import PipelineModule, LayerSpec, TiedLayerSpec
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/module.py", line 19, in <module>
    from ..activation_checkpointing import checkpointing
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 26, in <module>
    from deepspeed.runtime.config import DeepSpeedConfig
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 42, in <module>
    from ..elasticity import (
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/elasticity/__init__.py", line 10, in <module>
    from .elastic_agent import DSElasticAgent
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/elasticity/elastic_agent.py", line 9, in <module>
    from torch.distributed.elastic.agent.server.api import log, _get_socket_with_port
ImportError: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py)
Traceback (most recent call last):
  File "/root/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 18, in <module>
    from internvl.dist_utils import init_dist
  File "/root/InternVL/internvl_chat/internvl/dist_utils.py", line 6, in <module>
    import deepspeed
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/__init__.py", line 26, in <module>
    from . import module_inject
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/module_inject/__init__.py", line 6, in <module>
    from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 607, in <module>
    from ..pipe import PipelineModule
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/pipe/__init__.py", line 6, in <module>
    from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/__init__.py", line 6, in <module>
    from .module import PipelineModule, LayerSpec, TiedLayerSpec
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/module.py", line 19, in <module>
    from ..activation_checkpointing import checkpointing
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 26, in <module>
    from deepspeed.runtime.config import DeepSpeedConfig
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 42, in <module>
    from ..elasticity import (
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/elasticity/__init__.py", line 10, in <module>
    from .elastic_agent import DSElasticAgent
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/elasticity/elastic_agent.py", line 9, in <module>
    from torch.distributed.elastic.agent.server.api import log, _get_socket_with_port
ImportError: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py)
Traceback (most recent call last):
  File "/root/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 18, in <module>
    from internvl.dist_utils import init_dist
  File "/root/InternVL/internvl_chat/internvl/dist_utils.py", line 6, in <module>
    import deepspeed
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/__init__.py", line 26, in <module>
    from . import module_inject
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/module_inject/__init__.py", line 6, in <module>
    from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 607, in <module>
    from ..pipe import PipelineModule
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/pipe/__init__.py", line 6, in <module>
    from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/__init__.py", line 6, in <module>
    from .module import PipelineModule, LayerSpec, TiedLayerSpec
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/module.py", line 19, in <module>
    from ..activation_checkpointing import checkpointing
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 26, in <module>
    from deepspeed.runtime.config import DeepSpeedConfig
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 42, in <module>
    from ..elasticity import (
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/elasticity/__init__.py", line 10, in <module>
    from .elastic_agent import DSElasticAgent
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/elasticity/elastic_agent.py", line 9, in <module>
    from torch.distributed.elastic.agent.server.api import log, _get_socket_with_port
ImportError: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py)
W0805 09:29:58.163211 123911748675072 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3279 closing signal SIGTERM
W0805 09:29:58.163818 123911748675072 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3280 closing signal SIGTERM
W0805 09:29:58.164009 123911748675072 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3281 closing signal SIGTERM
E0805 09:29:58.243624 123911748675072 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3278) of binary: /root/miniconda3/envs/internvl/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/internvl/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
internvl/train/internvl_chat_finetune.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-05_09:29:58
  host      : RTX3090-18700172
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3278)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
a914356887 commented 3 months ago

python_version.txt All dependent package versions

LSKhappychild commented 3 months ago

Any progress on this? I'm facing the same issue

a914356887 commented 3 months ago

有什么进展吗?我也遇到了同样的问题

Can't solve it, I switched to Swift

Hoantrbl commented 2 months ago

I also face this problem, any replace plans? Can you help me?

有什么进展吗?我也遇到了同样的问题

Can't solve it, I switched to Swift

Hoantrbl commented 2 months ago

I also face this problem, any replace plans? Can you help me?

有什么进展吗?我也遇到了同样的问题

Can't solve it, I switched to Swift

Problem solve! You need install the previous version of torch like 2.1.0,not the latest version. Then, you also need to reemploy flash-atten after the new torch installing.

vardaan123 commented 1 month ago

facing the same issue @Hoantrbl Could you tell me your torch, cuda and flash-attn versions? Thanks

Hoantrbl commented 1 month ago

facing the same issue @Hoantrbl Could you tell me your torch, cuda and flash-attn versions? Thanks

torch 2.1.0+cu121 torchaudio 2.1.0+cu121 torchvision 0.16.0+cu121 flash_attn 2.6.3