buffer.grad is not None 请问这个错怎么解决

thunder95 commented 2 months ago

执行双卡训练命令: python -m torch.distributed.run --nproc_per_node=2 train_pipeline.py --cfg-path lavis/projects/pp_qwen14b/train_pp.yaml --num-stages 2

File "/data/workspace/MPP-LLaVA/train_pipeline.py", line 228, in main loss = engine.train_batch(data_iter=train_iter) File "/data/hulei/miniconda3/envs/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 388, in train_batch self._exec_schedule(sched) File "/data/hulei/miniconda3/envs/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 1422, in _exec_schedule self._exec_instr(**cmd.kwargs) File "/data/hulei/miniconda3/envs/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 1102, in _exec_send_grads assert buffer.grad is not None

Coobiw commented 2 months ago

我没有遇到过这个问题，方便截图看一下你的pipelayer的分布嘛

thunder95 commented 2 months ago

辛苦大佬代码没有改动。报了一个warning: /torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")

pipelayer没有改动过

[2024-09-10 09:18:35,522] [INFO] [module.py:396:_partition_layers] Partitioning pipeline stages with method uniform stage=0 layers=25 0: TokenizerPipeLayer 1: IndentityPipeLayer 2: IndentityPipeLayer 3: IndentityPipeLayer 4: IndentityPipeLayer 5: QwenBlockPipeLayer 6: QwenBlockPipeLayer 7: QwenBlockPipeLayer 8: QwenBlockPipeLayer 9: QwenBlockPipeLayer 10: QwenBlockPipeLayer 11: QwenBlockPipeLayer 12: QwenBlockPipeLayer 13: QwenBlockPipeLayer 14: QwenBlockPipeLayer 15: QwenBlockPipeLayer 16: QwenBlockPipeLayer 17: QwenBlockPipeLayer 18: QwenBlockPipeLayer 19: QwenBlockPipeLayer 20: QwenBlockPipeLayer 21: QwenBlockPipeLayer 22: QwenBlockPipeLayer 23: QwenBlockPipeLayer 24: QwenBlockPipeLayer stage=1 layers=24 25: QwenBlockPipeLayer 26: QwenBlockPipeLayer 27: QwenBlockPipeLayer 28: QwenBlockPipeLayer 29: QwenBlockPipeLayer 30: QwenBlockPipeLayer 31: QwenBlockPipeLayer 32: QwenBlockPipeLayer 33: QwenBlockPipeLayer 34: QwenBlockPipeLayer 35: QwenBlockPipeLayer 36: QwenBlockPipeLayer 37: QwenBlockPipeLayer 38: QwenBlockPipeLayer 39: QwenBlockPipeLayer 40: QwenBlockPipeLayer 41: QwenBlockPipeLayer 42: QwenBlockPipeLayer 43: QwenBlockPipeLayer 44: QwenBlockPipeLayer 45: FLNPipeLayer 46: LMPipeLayer 47: LossPipeLayer 48: IndentityPipeLayerLast GPU1 Trainable Params: 1000000

Coobiw commented 2 months ago

看起来没什么问题，要不检查下关键库的版本吧（torch、transformers、accelerate、deepspeed等）

thunder95 commented 2 months ago

都是按照requirements重新创建的conda环境。应该是deepspeed的版本可能有差异，我用了最新版的以及requirements.txt里的deepspeed==0.13.5都不对。具体错误的地方在于attention_mask没有梯度，deepspeed里稍微处理了(假设attentionmask在最后一个)


# Drop the attention mask from the input buffer here. It does not have
# a grad that needs to be communicated. We free the buffer immediately
# after, so no need to restore it. The receiver also has a hack that skips
# the recv. This is because NCCL does not let us send torch.BoolTensor :-(.
if self.has_attention_mask or self.has_bool_tensors:
    inputs = list(inputs)
    inputs.pop()
    inputs = tuple(inputs)

大佬，你当时跑用的哪个版本deepspeed

Coobiw commented 2 months ago

https://github.com/Coobiw/MPP-LLaVA/blob/master/lavis/models/minigpt4qwen_models/minigpt4qwen_pipe.py#L138

attention_mask的梯度问题我之前也遇到过，上面的代码行已经将attention_mask的requires_grad设成True了，我后续就遇到相关的什么问题了，你可以debug看下可能是其他哪里的问题

deepspeed版本的话：deepspeed==0.13.5

thunder95 commented 2 months ago

deepspeed处理这部分有些问题，我注释了这两行，然后在deepspeed里梯度传递的时候过滤掉requires_grad了就可以 @Coobiw rotary_pos_emb_list.requiresgrad(True) attention_mask.requiresgrad(True)

Coobiw / MPP-LLaVA

buffer.grad is not None 请问这个错怎么解决 #34