aws-neuron / neuronx-distributed

MIT No Attribution
30 stars 5 forks source link

Error: "Backward sending grads, but get None" #19

Closed wfckl789 closed 1 month ago

wfckl789 commented 2 months ago

Hi, I'm encountering an error Backward sending grads, but get None raised by the bwd_postprocess_task() during the model training. It seems that tensor will lose its requires_grad property after passing into this source code tensor_recv_next = xm.all_reduce(xm.REDUCE_SUM, tensor_recv_next, groups=groups) in src/neuronx_distributed/pipeline/comm.py.

This error also happens when I tried the demo Training Llama-2-13B/70B with Tensor Parallelism and Pipeline Parallelism (neuronx-distributed ) provided by the neuron document.

This is the log and compiler info: simple.log

`2024-04-01 06:59:57.748428: W torch_xla/csrc/lowering_context.cpp:71] No custom opname metadata! op_type=xla___op_TransferWithStaticRingTransfer
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 982, in _exec_schedule
    self._exec_instr()
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 920, in _bwd_postprocess_task
    raise RuntimeError(rmsg("Backward sending grads, but get None"))
RuntimeError: [rank_8_pp1_tp0_dp0] Backward sending grads, but get None
Traceback (most recent call last):
  File "run_simple_model_nxd.py", line 289, in <module>
    _mp_fn(0, args)
  File "run_simple_model_nxd.py", line 225, in _mp_fn
    train_simple_model(args)
  File "run_simple_model_nxd.py", line 188, in train_simple_model
    loss = model.run_train(
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/trainer/model.py", line 25, in run_train
    return self.module.run_train(*args, **kwargs)
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 542, in run_train
    loss = self._run_train(**kwargs)
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 561, in _run_train
    self._exec_schedule(self.train_scheduler)
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 982, in _exec_schedule
    self._exec_instr()
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 920, in _bwd_postprocess_task
    raise RuntimeError(rmsg("Backward sending grads, but get None"))
RuntimeError: [rank_24_pp3_tp0_dp0] Backward sending grads, but get None`

Package version: image image image

Other system details: instance: Trn1 OS: Ubuntu 20.04

If you need other information, please let me konw. Thanks.

jluntamazon commented 2 months ago

Hi @wfckl789, thank you for raising the issue.

The issue is happening because of a bug in torch-xla where the gradient property is not retained after an all-reduce operation. The fix is already applied in the torch-xla code, but it is not available in the torch-xla==2.1 release. As a temporary workaround, we have added a patch in Neuronx-Distributed for this particular case and it should be available with 2.18 release.

If you could upgrade to Neuronx-Distributed==0.7, this issue should be resolved.