aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
420 stars 136 forks source link

Error: MPMD detected but reload is not supported yet #882

Closed wfckl789 closed 1 month ago

wfckl789 commented 1 month ago

Hi, I found the error MPMD detected but reload is not supported yet will occur if I open Eager Debug Mode for a model trained in neuron distributed environment where dp=1, tp=8, pp=4. Could you help check this issue? Thanks so much!

image

I attach related scripts here and you can simply run ./run_simple_model_tp_pp.sh after download them.

scripts.zip

Environment information:

EC2 Instance: trn1.32.xlarge

OS: Ubuntu 20.04

Neuron Pytorch: Latest 2.18

aws-rhsoln commented 1 month ago

Its a duplicate of this issue: https://github.com/aws-neuron/neuronx-distributed/issues/21 . Closing this one