Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu

请问下用V100 32机器运行报了下面的错误

sh finetune_lora.sh

 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  NVIDIA Inference is only supported on Ampere and newer architectures
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
[2024-06-10 22:37:04,519] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-10 22:37:04,519] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-06-10 22:37:49,418] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 741, num_elems = 8.54B
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 7/7 [00:09<00:00,  1.37s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Currently using LoRA for fine-tuning the MiniCPM-V model.
{'Total': 8564355312, 'Trainable': 1059430640}
llm_type=llama3
Loading data...
Detected kernel version 4.19.118, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
Using /home/normalop/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/normalop/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.4184679985046387 seconds
Parameter Offload: Total persistent parameters: 706800 in 346 params
  0%|                                                                                                             | 0/1000 [00:00<?, ?it/s]/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
Traceback (most recent call last):
  File "/home/normalop/work/MiniCPM-V/finetune/finetune.py", line 328, in <module>
    train()
  File "/home/normalop/work/MiniCPM-V/finetune/finetune.py", line 318, in train
    trainer.train()
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
    return inner_training_loop(
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/normalop/work/MiniCPM-V/finetune/trainer.py", line 220, in training_step
    self.accelerator.backward(loss)
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2117, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
    self.engine.step()
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2169, in step
    self._take_model_step(lr_kwargs)
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
    self.optimizer.step()
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
    self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
    self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
  0%|                                                                                                             | 0/1000 [00:12<?, ?it/s]
[2024-06-10 22:39:04,779] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 22594) of binary: /home/normalop/work/MiniCPM-V/venv/bin/python
Traceback (most recent call last):
  File "/home/normalop/work/MiniCPM-V/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/normalop/work/MiniCPM-V/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

 "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }

GPUS_PER_NODE=1
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001

MODEL="openbmb/MiniCPM-Llama3-V-2_5" # or openbmb/MiniCPM-V-2
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="../data/trainging_data.json"
EVAL_DATA="../data/test_data.json"
LLM_TYPE="llama3" # if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS finetune.py  \
    --model_name_or_path $MODEL \
    --llm_type $LLM_TYPE \
    --data_path $DATA \
    --eval_data_path $EVAL_DATA \
    --remove_unused_columns false \
    --label_names "labels" \
    --prediction_loss_only false \
    --bf16 false \
    --bf16_full_eval false \
    --fp16 true \
    --fp16_full_eval true \
    --do_train \
    --do_eval \
    --tune_vision true \
    --tune_llm false \
    --use_lora true \
    --lora_target_modules "llm\..*layers\.\d+\.self_attn\.(q_proj|k_proj)" \
    --model_max_length 1024 \
    --max_slice_nums 9 \
    --max_steps 1000 \
    --eval_steps 1000 \
    --output_dir output/output_minicpmv2_lora \
    --logging_dir output/output_minicpmv2_lora \
    --logging_strategy "steps" \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "steps" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 10 \
    --learning_rate 1e-6 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --gradient_checkpointing true \
    --deepspeed ds_config_zero3.json \                                                                                                                           41,27         Top

OpenBMB / MiniCPM-V

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu #246