Open wenboqian opened 7 months ago
Thank you for reporting the issue. MPMD detected
error means that 1 worker is trying to load a new graph vs the other worker is waiting to perform inference, and hence the graphs that have collectives are not able to communicate with each other. Here the assumption is that each worker performs the same set of operations and hence all workers would be in SPMD mode. This assumption is broken and hence you see the above error.
We are looking into the scripts now to identify which worker is producing a new graph (and why) and if we can modify the scripts such that graphs with collectives do not change from iteration to another. Will report back once we have an update.
Thank you for reporting the issue.
MPMD detected
error means that 1 worker is trying to load a new graph vs the other worker is waiting to perform inference, and hence the graphs that have collectives are not able to communicate with each other. Here the assumption is that each worker performs the same set of operations and hence all workers would be in SPMD mode. This assumption is broken and hence you see the above error. We are looking into the scripts now to identify which worker is producing a new graph (and why) and if we can modify the scripts such that graphs with collectives do not change from iteration to another. Will report back once we have an update.
Thanks for your detailed explain! And I'd like to know that: Will this error interrupt the process of generating hlo graphs? I mean, my assumption is that I can't get all hlo graphs because the graph generation process will be shut down due to this error. Is my assumption right?
So at a time, only one graph can be executed. In this case, since the run errored out at this graph, you can generate only graphs upto this point. If you want to generate all the graphs without worrying about execution, you can run with neuron_parallel_compile . The utility should help to extract all the HLOs, compile them and save them in the cache.
We have managed to reproduce the issue. The issue happens only with eager mode. There seems to be bug which causes two collectives with different replica groups to be part of the same graph, such that it now has to communicate with two graphs at the same time. If you look at one of the graphs, it produces the following collectives in the same graph:
%all-reduce = (bf16[1,4]{1,0}, bf16[]) all-reduce(bf16[1,4]{1,0} %p1, bf16[] %convert), replica_groups={{0,8},{16,24},{1,9},{17,25},{2,10},{18,26},{3,11},{19,27},{4,12},{20,28},{5,13},{21,29},{6,14},{22,30},{7,15},{23,31}}, constrain_layout=true, to_apply=%AddComputation.7, metadata={op_type="xla__cross_replica_sum" op_name="NxDModel[model]/NxDPPModel[module]/xla__cross_replica_sum" source_file="/home/ubuntu/aws_neuron_venv/lib/python3.8/site-packages/torch_xla/core/xla_model.py" source_line=590}
...
%all-reduce.1 = (bf16[1,4]{1,0}, bf16[]) all-reduce(bf16[1,4]{1,0} %p2, bf16[] %convert.2), replica_groups={{8,16},{24,0},{9,17},{25,1},{10,18},{26,2},{11,19},{27,3},{12,20},{28,4},{13,21},{29,5},{14,22},{30,6},{15,23},{31,7}}, constrain_layout=true, to_apply=%AddComputation.20, metadata={op_type="xla__cross_replica_sum" op_name="NxDModel[model]/NxDPPModel[module]/xla__cross_replica_sum" source_file="/home/ubuntu/aws_neuron_venv/lib/python3.8/site-packages/torch_xla/core/xla_model.py" source_line=590}
As you see it is trying to send and receive a tensor at the same time. Ideally they should be in two separate graphs. We will look into this issue and update on this ticket when we have a fix. Note: Eager debug mode is meant mainly for single worker workloads, and distributed workload with eager mode is not supported yet. If the intention is mainly to debug the script, you can make use of this guide: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programm[…]training/pytorch-neuron-debug.html?highlight=print
@wfckl789 just want to check in to see if you were you able to make forward progress?
@wfckl789 just want to check in to see if you were you able to make forward progress?
In this case as I obeserved, the cc compiler raised this compilation fault before the forward progress was executed. So I think the forward progress didn't make it because I didn't see the loss value from the first epoch.
Any particular reason for trying eager mode in mutli-worker case? Note: Eager debug mode is only for debugging and is not the most performant mode of execution
Hi, I found the error
MPMD detected but reload is not supported yet
will occur if I openEager Debug Mode
for a model trained in neuron distributed environment where dp=1, tp=8, pp=4. Could you help check this issue? Thanks so much!I attach related scripts here and you can simply run
./run_simple_model_tp_pp.sh
after download them.scripts.zip
Environment information:
EC2 Instance: trn1.32.xlarge
OS: Ubuntu 20.04
Neuron Pytorch: Latest 2.18