aws / sagemaker-training-toolkit

Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
496 stars 118 forks source link

fix: SMDDP does not support P5 instances with SMP #194

Closed apoorvtintin closed 1 year ago

apoorvtintin commented 1 year ago

Created duplicate PR to pass PR pipeline

Issue #, if available:

Description of changes: SMDDP collectives are unsupported in P5 instance types, this change aims to not allow usage of SMDDP collectives with SMP training framework on P5 instances.

Testing done: Unit tests on smddpmprun pass on DLC PT2.0.1 before change ================ 3 failed, 282 passed, 2 skipped, 1 xfailed, 2 xpassed, 10 warnings, 2 error in 532.36 seconds ================ after change ================ 3 failed, 282 passed, 2 skipped, 1 xfailed, 2 xpassed, 10 warnings, 2 error in 532.36 seconds ================

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

Tests

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

emeraldbay commented 1 year ago

Please fix format error below

ERROR: invocation failed (exit code 1), logfile: /codebuild/output/src1953528505/src/github.com/aws/sagemaker-training-toolkit/.tox/flake8/log/flake8-0.log
--
33 | ================================== log start ===================================
34 | flake8 create: /codebuild/output/src1953528505/src/github.com/aws/sagemaker-training-toolkit/.tox/flake8
35 | flake8 installdeps: flake8, pep8-naming, flake8-import-order
36 | flake8 inst: /codebuild/output/src1953528505/src/github.com/aws/sagemaker-training-toolkit/.tox/.tmp/package/2/sagemaker_training-4.7.1.dev0.zip
37 | flake8 installed: bcrypt==4.0.1,boto3==1.28.52,botocore==1.31.52,cffi==1.15.1,cryptography==41.0.4,flake8==5.0.4,flake8-import-order==0.18.2,gevent==22.10.2,greenlet==2.0.2,importlib-metadata==4.2.0,inotify-simple==1.2.1,jmespath==1.0.1,MarkupSafe==2.1.3,mccabe==0.7.0,numpy==1.21.6,paramiko==3.3.1,pep8-naming==0.13.3,protobuf==3.20.3,psutil==5.9.5,pycodestyle==2.9.1,pycparser==2.21,pyflakes==2.5.0,PyNaCl==1.5.0,python-dateutil==2.8.2,retrying==1.3.4,s3transfer==0.6.2,sagemaker-training @ file:///codebuild/output/src1953528505/src/github.com/aws/sagemaker-training-toolkit/.tox/.tmp/package/2/sagemaker_training-4.7.1.dev0.zip,scipy==1.7.3,six==1.16.0,typing_extensions==4.7.1,urllib3==1.26.16,Werkzeug==2.2.3,zipp==3.15.0,zope.event==5.0,zope.interface==6.0
38 | flake8 run-test-pre: PYTHONHASHSEED='1248064069'
39 | flake8 run-test: commands[0] \| flake8 --config=.flake8
40 | ./src/sagemaker_training/mpi.py:207:10: N806 variable 'SMDDP_SMP_SUPPORTED_INSTANCES' in function should be lowercase
41 | ERROR: InvocationError for command /codebuild/output/src1953528505/src/github.com/aws/sagemaker-training-toolkit/.tox/flake8/bin/flake8 --config=.flake8 (exited with code 1)