Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
Hi, I found the error MPMD detected but reload is not supported yet will occur if I open Eager Debug Mode for a model trained in neuron distributed environment where dp=1, tp=8, pp=4. Could you help check this issue? Thanks so much!
I attach related scripts here and you can simply run ./run_simple_model_tp_pp.sh after download them.
Hi, I found the error
MPMD detected but reload is not supported yet
will occur if I openEager Debug Mode
for a model trained in neuron distributed environment where dp=1, tp=8, pp=4. Could you help check this issue? Thanks so much!I attach related scripts here and you can simply run
./run_simple_model_tp_pp.sh
after download them.scripts.zip
Environment information:
EC2 Instance: trn1.32.xlarge
OS: Ubuntu 20.04
Neuron Pytorch: Latest 2.18