Running FSDP, DS and Megatron DS with smddp backend will be supported on mpi distribution. We will start to track the error telemetry of SMDDP Exception.
Description of changes:
Includes the smddp exception in the mpi distribtuion
Testing done:
I have verified that training job works properly with new change.
Here is the sample job output
UnexpectedStatusException: Error for Training job vit-fsdp-2023-06-14-22-27-25-988: Failed. Reason: AlgorithmError: SMDDPError:
ExitCode 134
ErrorMessage "raise exceptions.SMDDPError('Error')
smdistributed.dataparallel.exceptions.SMDDPError: Error
Traceback (most recent call last)
File "/opt/conda/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.9/site-packages/mpi4py/__main__.py", line 7, in <module>
main()
File "/opt/conda/lib/python3.9/site-packages/mpi4py/run.py", line 198, in main
run_command_line(args)
File "/opt/conda/lib/python3.9/site-packages/mpi4py/run.py", line 47, in run_command_line
run_path(sys.argv[0], run_name='__main__')
File "/opt/conda/lib/python3.9/runpy.py", line 288, in run_path
return _run_module_code(code, init_globals, run_name,
File "/opt/conda/lib/python3.9/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "train_vit_fsdp_mpi.py", line 31, in <module>
import
Merge Checklist
Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.
Running FSDP, DS and Megatron DS with smddp backend will be supported on mpi distribution. We will start to track the error telemetry of SMDDP Exception.
Description of changes: Includes the smddp exception in the mpi distribtuion Testing done: I have verified that training job works properly with new change.
Here is the sample job output
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.