OpenBMB / MiniCPM-V

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone
Apache License 2.0
7.86k stars 547 forks source link

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu #246

Closed proto1994 closed 2 weeks ago

proto1994 commented 3 weeks ago

请问下用V100 32机器运行 报了下面的错误

sh finetune_lora.sh
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  NVIDIA Inference is only supported on Ampere and newer architectures
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
[2024-06-10 22:37:04,519] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-10 22:37:04,519] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-06-10 22:37:49,418] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 741, num_elems = 8.54B
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 7/7 [00:09<00:00,  1.37s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Currently using LoRA for fine-tuning the MiniCPM-V model.
{'Total': 8564355312, 'Trainable': 1059430640}
llm_type=llama3
Loading data...
Detected kernel version 4.19.118, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
Using /home/normalop/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/normalop/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.4184679985046387 seconds
Parameter Offload: Total persistent parameters: 706800 in 346 params
  0%|                                                                                                             | 0/1000 [00:00<?, ?it/s]/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
Traceback (most recent call last):
  File "/home/normalop/work/MiniCPM-V/finetune/finetune.py", line 328, in <module>
    train()
  File "/home/normalop/work/MiniCPM-V/finetune/finetune.py", line 318, in train
    trainer.train()
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
    return inner_training_loop(
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/normalop/work/MiniCPM-V/finetune/trainer.py", line 220, in training_step
    self.accelerator.backward(loss)
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2117, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
    self.engine.step()
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2169, in step
    self._take_model_step(lr_kwargs)
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
    self.optimizer.step()
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
    self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
    self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
  0%|                                                                                                             | 0/1000 [00:12<?, ?it/s]
[2024-06-10 22:39:04,779] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 22594) of binary: /home/normalop/work/MiniCPM-V/venv/bin/python
Traceback (most recent call last):
  File "/home/normalop/work/MiniCPM-V/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
 "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
GPUS_PER_NODE=1
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001

MODEL="openbmb/MiniCPM-Llama3-V-2_5" # or openbmb/MiniCPM-V-2
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="../data/trainging_data.json"
EVAL_DATA="../data/test_data.json"
LLM_TYPE="llama3" # if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS finetune.py  \
    --model_name_or_path $MODEL \
    --llm_type $LLM_TYPE \
    --data_path $DATA \
    --eval_data_path $EVAL_DATA \
    --remove_unused_columns false \
    --label_names "labels" \
    --prediction_loss_only false \
    --bf16 false \
    --bf16_full_eval false \
    --fp16 true \
    --fp16_full_eval true \
    --do_train \
    --do_eval \
    --tune_vision true \
    --tune_llm false \
    --use_lora true \
    --lora_target_modules "llm\..*layers\.\d+\.self_attn\.(q_proj|k_proj)" \
    --model_max_length 1024 \
    --max_slice_nums 9 \
    --max_steps 1000 \
    --eval_steps 1000 \
    --output_dir output/output_minicpmv2_lora \
    --logging_dir output/output_minicpmv2_lora \
    --logging_strategy "steps" \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "steps" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 10 \
    --learning_rate 1e-6 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --gradient_checkpointing true \
    --deepspeed ds_config_zero3.json \                                                                                                                           41,27         Top
1SingleFeng commented 3 weeks ago

你可以检查一下deepspeed版本,这可能是因为deepspeed版本导致的,将其降级到0.14.0可以解决