aws / sagemaker-training-toolkit

Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
496 stars 118 forks source link

feature: add support for SMDDP collectives to smdataparallel runner #162

Closed vishwakaria closed 1 year ago

vishwakaria commented 1 year ago

Description of changes: Add support for SMDDP collectives in PT DDP distribution via a new parameter in the dictionary:

communication_options: {
                        "backend": "nccl",  #default value is auto
                       }

The distribution will set a configuration parameter called sagemaker_communication_backend. If the value is auto, we will preload libsmddp which has the Sagemaker optimized implementation of AllReduce. If the value is nccl, we will just use the nccl-allReduce implementation.

Testing done:

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

Tests

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

sagemaker-bot commented 1 year ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 1 year ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 1 year ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 1 year ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

satishpasumarthi commented 1 year ago

Please follow the commit message style as per https://github.com/aws/sagemaker-training-toolkit/blob/master/CONTRIBUTING.md#committing-your-change If you are done with your changes, see if you can squash them into one @vishwakaria

vishwakaria commented 1 year ago

Please follow the commit message style as per https://github.com/aws/sagemaker-training-toolkit/blob/master/CONTRIBUTING.md#committing-your-change If you are done with your changes, see if you can squash them into one @vishwakaria

Addressed your comments and squashed all commits. Can you take another look @satishpasumarthi? Thank you.

sagemaker-bot commented 1 year ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 1 year ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository