Closed vishwakaria closed 1 year ago
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Please follow the commit message style as per https://github.com/aws/sagemaker-training-toolkit/blob/master/CONTRIBUTING.md#committing-your-change If you are done with your changes, see if you can squash them into one @vishwakaria
Please follow the commit message style as per https://github.com/aws/sagemaker-training-toolkit/blob/master/CONTRIBUTING.md#committing-your-change If you are done with your changes, see if you can squash them into one @vishwakaria
Addressed your comments and squashed all commits. Can you take another look @satishpasumarthi? Thank you.
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Description of changes: Add support for SMDDP collectives in PT DDP distribution via a new parameter in the dictionary:
The distribution will set a configuration parameter called
sagemaker_communication_backend
. If the value isauto
, we will preload libsmddp which has the Sagemaker optimized implementation of AllReduce. If the value isnccl
, we will just use the nccl-allReduce implementation.Testing done:
Added unit tests
Created a custom docker image and ran successful training
Related PySDK PR: https://github.com/aws/sagemaker-python-sdk/pull/3490
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.