Closed thunder95 closed 2 months ago
我没有遇到过这个问题,方便截图看一下你的pipelayer的分布嘛
辛苦大佬 代码没有改动。报了一个warning: /torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
pipelayer没有改动过
[2024-09-10 09:18:35,522] [INFO] [module.py:396:_partition_layers] Partitioning pipeline stages with method uniform stage=0 layers=25 0: TokenizerPipeLayer 1: IndentityPipeLayer 2: IndentityPipeLayer 3: IndentityPipeLayer 4: IndentityPipeLayer 5: QwenBlockPipeLayer 6: QwenBlockPipeLayer 7: QwenBlockPipeLayer 8: QwenBlockPipeLayer 9: QwenBlockPipeLayer 10: QwenBlockPipeLayer 11: QwenBlockPipeLayer 12: QwenBlockPipeLayer 13: QwenBlockPipeLayer 14: QwenBlockPipeLayer 15: QwenBlockPipeLayer 16: QwenBlockPipeLayer 17: QwenBlockPipeLayer 18: QwenBlockPipeLayer 19: QwenBlockPipeLayer 20: QwenBlockPipeLayer 21: QwenBlockPipeLayer 22: QwenBlockPipeLayer 23: QwenBlockPipeLayer 24: QwenBlockPipeLayer stage=1 layers=24 25: QwenBlockPipeLayer 26: QwenBlockPipeLayer 27: QwenBlockPipeLayer 28: QwenBlockPipeLayer 29: QwenBlockPipeLayer 30: QwenBlockPipeLayer 31: QwenBlockPipeLayer 32: QwenBlockPipeLayer 33: QwenBlockPipeLayer 34: QwenBlockPipeLayer 35: QwenBlockPipeLayer 36: QwenBlockPipeLayer 37: QwenBlockPipeLayer 38: QwenBlockPipeLayer 39: QwenBlockPipeLayer 40: QwenBlockPipeLayer 41: QwenBlockPipeLayer 42: QwenBlockPipeLayer 43: QwenBlockPipeLayer 44: QwenBlockPipeLayer 45: FLNPipeLayer 46: LMPipeLayer 47: LossPipeLayer 48: IndentityPipeLayerLast GPU1 Trainable Params: 1000000
看起来没什么问题,要不检查下关键库的版本吧(torch、transformers、accelerate、deepspeed等)
都是按照requirements重新创建的conda环境。 应该是deepspeed的版本可能有差异,我用了最新版的以及requirements.txt里的deepspeed==0.13.5都不对。 具体错误的地方在于attention_mask没有梯度,deepspeed里稍微处理了(假设attentionmask在最后一个)
# Drop the attention mask from the input buffer here. It does not have
# a grad that needs to be communicated. We free the buffer immediately
# after, so no need to restore it. The receiver also has a hack that skips
# the recv. This is because NCCL does not let us send torch.BoolTensor :-(.
if self.has_attention_mask or self.has_bool_tensors:
inputs = list(inputs)
inputs.pop()
inputs = tuple(inputs)
大佬,你当时跑用的哪个版本deepspeed
attention_mask
的梯度问题我之前也遇到过,上面的代码行已经将attention_mask
的requires_grad设成True了,我后续就遇到相关的什么问题了,你可以debug看下可能是其他哪里的问题
deepspeed版本的话:deepspeed==0.13.5
deepspeed处理这部分有些问题,我注释了这两行, 然后在deepspeed里梯度传递的时候过滤掉requires_grad了就可以 @Coobiw rotary_pos_emb_list.requiresgrad(True) attention_mask.requiresgrad(True)
执行双卡训练命令: python -m torch.distributed.run --nproc_per_node=2 train_pipeline.py --cfg-path lavis/projects/pp_qwen14b/train_pp.yaml --num-stages 2
File "/data/workspace/MPP-LLaVA/train_pipeline.py", line 228, in main loss = engine.train_batch(data_iter=train_iter) File "/data/hulei/miniconda3/envs/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 388, in train_batch self._exec_schedule(sched) File "/data/hulei/miniconda3/envs/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 1422, in _exec_schedule self._exec_instr(**cmd.kwargs) File "/data/hulei/miniconda3/envs/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 1102, in _exec_send_grads assert buffer.grad is not None