aws-neuron / neuronx-distributed

MIT No Attribution
30 stars 5 forks source link

MPMD detected error when using `optimum-neuron` with TP #24

Open michaelbenayoun opened 6 days ago

michaelbenayoun commented 6 days ago

So basically I am trying to train LLama / Mistral.

I run the following command:

NEURON_RT_LOG_LEVEL=info XLA_USE_BF16=1 ./train_mistral.sh

Here is the link to train_mistral.sh

The issue is that I get MPMD detected. It means that at some point at least 2 workers try to execute different graphs. So I tried to check the diff between the two HLO graphs. I ran the script multiple times, I cannot say I always end-up with the same diff, but at least multiple times I ended up with this:

image (1)

Basically:

After analyzing it a bit, I think it this computation comes from the ParallelEmbedding layer. For some reason what is considered a constant equal to 0 in one case, is considered a parameter in the other case.

I thought it could be linked to scalar specialization by XLA so I also ran the job with XLA_NO_SPECIAL_SCALARS=1 but ended up with a MPMD detected error as well.

So I tried not to use ParallelEmbedding. When sequence parallelism is enabled I end-up with:

image (2)

Finally, I tried disabling sequence parallelism and ended-up with:

image (3)

Note: when I disable tensor parallelism it seems to be working properly.