Closed wfckl789 closed 1 month ago
Hi @wfckl789, thank you for raising the issue.
The issue is happening because of a bug in torch-xla
where the gradient property is not retained after an all-reduce operation. The fix is already applied in the torch-xla
code, but it is not available in the torch-xla==2.1
release. As a temporary workaround, we have added a patch in Neuronx-Distributed for this particular case and it should be available with 2.18 release.
If you could upgrade to Neuronx-Distributed==0.7
, this issue should be resolved.
Hi, I'm encountering an error
Backward sending grads, but get None
raised by thebwd_postprocess_task()
during the model training. It seems that tensor will lose its requires_grad property after passing into this source codetensor_recv_next = xm.all_reduce(xm.REDUCE_SUM, tensor_recv_next, groups=groups)
in src/neuronx_distributed/pipeline/comm.py.This error also happens when I tried the demo Training Llama-2-13B/70B with Tensor Parallelism and Pipeline Parallelism (neuronx-distributed ) provided by the neuron document.
This is the log and compiler info: simple.log
Package version:
![image](https://github.com/aws-neuron/neuronx-distributed/assets/42508752/750bebeb-1d7e-44cc-83d9-d48f506d9365)
Other system details: instance: Trn1 OS: Ubuntu 20.04
If you need other information, please let me konw. Thanks.