aws / sagemaker-training-toolkit

Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
496 stars 118 forks source link

fix: Add NCCL_PROTO=simple environment variable to handle the out-of-order… #196

Closed ruhanprasad closed 1 year ago

ruhanprasad commented 1 year ago

… data delivery from EFA

Issue #, if available:

Description of changes: Setting NCCL_PROTO=simple explicitly to avoid data corruption errors due to out-of-order data delivery from EFA. This variable is set automatically by the latest aws-nccl-ofi (ref: https://github.com/aws/aws-ofi-nccl/blob/3d5e21db5f2b0f307c5d0c206c4b1b66d72d0d04/src/platform-aws.c#L222C28-L222C28) but earlier versions released in previous DLCs do not.

Also set the RDMAV_FORK_SAFE variable to be fully consistent with the environment variables set in the smdistributed launcher.

Testing done:

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

Tests

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.