Also set the RDMAV_FORK_SAFE variable to be fully consistent with the environment variables set in the smdistributed launcher.
Testing done:
Merge Checklist
Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.
… data delivery from EFA
Issue #, if available:
Description of changes: Setting NCCL_PROTO=simple explicitly to avoid data corruption errors due to out-of-order data delivery from EFA. This variable is set automatically by the latest aws-nccl-ofi (ref: https://github.com/aws/aws-ofi-nccl/blob/3d5e21db5f2b0f307c5d0c206c4b1b66d72d0d04/src/platform-aws.c#L222C28-L222C28) but earlier versions released in previous DLCs do not.
Also set the RDMAV_FORK_SAFE variable to be fully consistent with the environment variables set in the smdistributed launcher.
Testing done:
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.