aws / sagemaker-training-toolkit

Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
488 stars 117 forks source link

fix: SMDDP does not support P5 instances with SMP #192

Closed apoorvtintin closed 11 months ago

apoorvtintin commented 1 year ago

Issue #, if available:

Description of changes: SMDDP collectives are unsupported in P5 instance types, this change aims to not allow usage of SMDDP collectives with SMP training framework on P5 instances.

Testing done: Unit tests on smddpmprun pass on DLC PT2.0.1 before change ================ 3 failed, 282 passed, 2 skipped, 1 xfailed, 2 xpassed, 10 warnings, 2 error in 532.36 seconds ================ after change ================ 3 failed, 282 passed, 2 skipped, 1 xfailed, 2 xpassed, 10 warnings, 2 error in 532.36 seconds ================

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

Tests

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

apoorvtintin commented 1 year ago

@emeraldbay Due to significant changes in hardware, P5 support would be added some time later in the future

emeraldbay commented 1 year ago

Please fix format through tox -e flake8,black-check,pylint