aws-neuron / neuronx-distributed

MIT No Attribution
30 stars 5 forks source link

Error: MPMD detected but reload is not supported yet for neuron distributed environment with EAGER DEBUG MODE #21

Open wfckl789 opened 2 months ago

wfckl789 commented 2 months ago

Hi, I found the error MPMD detected but reload is not supported yet will occur if I open Eager Debug Mode for a model trained in neuron distributed environment where dp=1, tp=8, pp=4. Could you help check this issue? Thanks so much!

image

I attach related scripts here and you can simply run ./run_simple_model_tp_pp.sh after download them.

scripts.zip

Environment information:

EC2 Instance: trn1.32.xlarge

OS: Ubuntu 20.04

Neuron Pytorch: Latest 2.18

aws-rhsoln commented 1 month ago

Thank you for reporting the issue. MPMD detected error means that 1 worker is trying to load a new graph vs the other worker is waiting to perform inference, and hence the graphs that have collectives are not able to communicate with each other. Here the assumption is that each worker performs the same set of operations and hence all workers would be in SPMD mode. This assumption is broken and hence you see the above error. We are looking into the scripts now to identify which worker is producing a new graph (and why) and if we can modify the scripts such that graphs with collectives do not change from iteration to another. Will report back once we have an update.

wfckl789 commented 1 month ago

Thank you for reporting the issue. MPMD detected error means that 1 worker is trying to load a new graph vs the other worker is waiting to perform inference, and hence the graphs that have collectives are not able to communicate with each other. Here the assumption is that each worker performs the same set of operations and hence all workers would be in SPMD mode. This assumption is broken and hence you see the above error. We are looking into the scripts now to identify which worker is producing a new graph (and why) and if we can modify the scripts such that graphs with collectives do not change from iteration to another. Will report back once we have an update.

Thanks for your detailed explain! And I'd like to know that: Will this error interrupt the process of generating hlo graphs? I mean, my assumption is that I can't get all hlo graphs because the graph generation process will be shut down due to this error. Is my assumption right?

aws-rhsoln commented 1 month ago

So at a time, only one graph can be executed. In this case, since the run errored out at this graph, you can generate only graphs upto this point. If you want to generate all the graphs without worrying about execution, you can run with neuron_parallel_compile . The utility should help to extract all the HLOs, compile them and save them in the cache.

aws-rhsoln commented 1 month ago

We have managed to reproduce the issue. The issue happens only with eager mode. There seems to be bug which causes two collectives with different replica groups to be part of the same graph, such that it now has to communicate with two graphs at the same time. If you look at one of the graphs, it produces the following collectives in the same graph:

%all-reduce = (bf16[1,4]{1,0}, bf16[]) all-reduce(bf16[1,4]{1,0} %p1, bf16[] %convert), replica_groups={{0,8},{16,24},{1,9},{17,25},{2,10},{18,26},{3,11},{19,27},{4,12},{20,28},{5,13},{21,29},{6,14},{22,30},{7,15},{23,31}}, constrain_layout=true, to_apply=%AddComputation.7, metadata={op_type="xla__cross_replica_sum" op_name="NxDModel[model]/NxDPPModel[module]/xla__cross_replica_sum" source_file="/home/ubuntu/aws_neuron_venv/lib/python3.8/site-packages/torch_xla/core/xla_model.py" source_line=590}
...
%all-reduce.1 = (bf16[1,4]{1,0}, bf16[]) all-reduce(bf16[1,4]{1,0} %p2, bf16[] %convert.2), replica_groups={{8,16},{24,0},{9,17},{25,1},{10,18},{26,2},{11,19},{27,3},{12,20},{28,4},{13,21},{29,5},{14,22},{30,6},{15,23},{31,7}}, constrain_layout=true, to_apply=%AddComputation.20, metadata={op_type="xla__cross_replica_sum" op_name="NxDModel[model]/NxDPPModel[module]/xla__cross_replica_sum" source_file="/home/ubuntu/aws_neuron_venv/lib/python3.8/site-packages/torch_xla/core/xla_model.py" source_line=590}

As you see it is trying to send and receive a tensor at the same time. Ideally they should be in two separate graphs. We will look into this issue and update on this ticket when we have a fix. Note: Eager debug mode is meant mainly for single worker workloads, and distributed workload with eager mode is not supported yet. If the intention is mainly to debug the script, you can make use of this guide: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programm[…]training/pytorch-neuron-debug.html?highlight=print

jeffhataws commented 1 month ago

@wfckl789 just want to check in to see if you were you able to make forward progress?

wfckl789 commented 1 month ago

@wfckl789 just want to check in to see if you were you able to make forward progress?

In this case as I obeserved, the cc compiler raised this compilation fault before the forward progress was executed. So I think the forward progress didn't make it because I didn't see the loss value from the first epoch.