[BUG] <全量参数微调>运行finetune_ds.sh后卡住

Waxyoung commented 8 months ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

运行finetune_ds.sh后卡在QWenAttention类的forward函数中的mixed_x_layer = self.c_attn(hidden_states) /usr/local/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) self._dummy_overflow_buf = get_accelerator().IntTensor([0]) Using /root/.cache/torch_extensions/py38_cu121 as PyTorch extensions root... /root/.cache/torch_extensions/py38_cu121/fused_adam Parameter Offload: Total persistent parameters: 1815808 in 491 params Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py38_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.6533713340759277 seconds /usr/local/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) self._dummy_overflow_buf = get_accelerator().IntTensor([0]) 0%|

从日志看到已经有训练进度条，进一步debug看到卡在QWenAttention类的forward函数中的mixed_x_layer = self.c_attn(hidden_states)，该函数是一个线性层，卡在nn.Linear中就再没有执行下一步了。

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:redhat7
- Python:3.8.8
- Transformers:4.31.0
- PyTorch:2.1.2+cu121
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 12.1

备注 | Anything else?

finetune_ds.sh脚本：

#!/bin/bash
# -*- coding: utf-8 -*-
export NCCL_DEBUG=INFO
# export NCCL_P2P_DISABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
DIR=`pwd`

GPUS_PER_NODE=8
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6011

MODEL="models/Qwen-VL/qwen/Qwen-VL" #"Qwen/Qwen-VL-Chat"/"Qwen/Qwen-VL" # Set the path if you do not want to load from huggingface directly
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="Qwen-VL/assets/train_json/temp.json"

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --bf16 True \
    --fix_vit True \
    --output_dir output_qwen \
    --num_train_epochs 5 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 10 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --deepspeed finetune/ds_config_zero3.json

No response

decreasbetter commented 7 months ago

您好，遇到了同样的问题，请问解决了吗

micsama commented 6 months ago

全参数微调需要多少显存呀

yihp commented 6 months ago

@micsama @Waxyoung @decreasbetter @hzhwcmhf 想请问全参数微调需要什么资源呢？

HarrytheOrange commented 5 months ago

也遇到了相同问题，和linux内核版本有关吗？看到了warning： warnings.warn( Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. /mnt/cache/huangzhiyuan/env/seeclick/lib/python3.11/site-packages/accelerate/accelerator.py:436: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches']). Please pass an accelerate.DataLoaderConfiguration instead: dataloader_config = DataLoaderConfiguration(dispatch_batches=None)

YeTianJHU commented 5 months ago

Save issue here.

RomanticQq commented 4 months ago

根据我训练时的经验，在qwen-vl中走到开始训练时卡住的原因是数据集的问题，检查一下自己的json文件，或则可以尝试只有json文件中取出几条数据再试试。

QwenLM / Qwen-VL