aws / sagemaker-training-toolkit

Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
488 stars 117 forks source link

Add SM dataparallel exception class in mpi distribution #185

Closed stu1130 closed 1 year ago

stu1130 commented 1 year ago

Running FSDP, DS and Megatron DS with smddp backend will be supported on mpi distribution. We will start to track the error telemetry of SMDDP Exception.

Description of changes: Includes the smddp exception in the mpi distribtuion Testing done: I have verified that training job works properly with new change.

Here is the sample job output

UnexpectedStatusException: Error for Training job vit-fsdp-2023-06-14-22-27-25-988: Failed. Reason: AlgorithmError: SMDDPError:
ExitCode 134
ErrorMessage "raise exceptions.SMDDPError('Error')
 smdistributed.dataparallel.exceptions.SMDDPError: Error
 Traceback (most recent call last)
 File "/opt/conda/lib/python3.9/runpy.py", line 197, in _run_module_as_main
 return _run_code(code, main_globals, None,
 File "/opt/conda/lib/python3.9/runpy.py", line 87, in _run_code
 exec(code, run_globals)
 File "/opt/conda/lib/python3.9/site-packages/mpi4py/__main__.py", line 7, in <module>
 main()
 File "/opt/conda/lib/python3.9/site-packages/mpi4py/run.py", line 198, in main
 run_command_line(args)
 File "/opt/conda/lib/python3.9/site-packages/mpi4py/run.py", line 47, in run_command_line
 run_path(sys.argv[0], run_name='__main__')
 File "/opt/conda/lib/python3.9/runpy.py", line 288, in run_path
 return _run_module_code(code, init_globals, run_name,
 File "/opt/conda/lib/python3.9/runpy.py", line 97, in _run_module_code
 _run_code(code, mod_globals, init_globals,
 File "train_vit_fsdp_mpi.py", line 31, in <module>
 import 

​

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

Tests

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

yl-to commented 1 year ago

Did you do any local test that shows smddp error is catchable in mpi runner?