Open Waxyoung opened 8 months ago
您好,遇到了同样的问题,请问解决了吗
全参数微调需要多少显存呀
@micsama @Waxyoung @decreasbetter @hzhwcmhf 想请问全参数微调需要什么资源呢?
也遇到了相同问题,和linux内核版本有关吗?
看到了warning:
warnings.warn(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/mnt/cache/huangzhiyuan/env/seeclick/lib/python3.11/site-packages/accelerate/accelerator.py:436: FutureWarning: Passing the following arguments to Accelerator
is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches']). Please pass an accelerate.DataLoaderConfiguration
instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None)
Save issue here.
根据我训练时的经验,在qwen-vl中走到开始训练时卡住的原因是数据集的问题,检查一下自己的json文件,或则可以尝试只有json文件中取出几条数据再试试。
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
运行finetune_ds.sh后卡在QWenAttention类的forward函数中的mixed_x_layer = self.c_attn(hidden_states) /usr/local/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) self._dummy_overflow_buf = get_accelerator().IntTensor([0]) Using /root/.cache/torch_extensions/py38_cu121 as PyTorch extensions root... /root/.cache/torch_extensions/py38_cu121/fused_adam Parameter Offload: Total persistent parameters: 1815808 in 491 params Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py38_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.6533713340759277 seconds /usr/local/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) self._dummy_overflow_buf = get_accelerator().IntTensor([0]) 0%|
从日志看到已经有训练进度条,进一步debug看到卡在QWenAttention类的forward函数中的mixed_x_layer = self.c_attn(hidden_states),该函数是一个线性层,卡在nn.Linear中就再没有执行下一步了。
期望行为 | Expected Behavior
No response
复现方法 | Steps To Reproduce
No response
运行环境 | Environment
备注 | Anything else?
finetune_ds.sh脚本:
No response