Closed ydaiming closed 2 years ago
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
LGTM !
Is there a related patch which confirms that the latest EFA installer is being used since this flag has dependency on that.
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
@rashika, We have a check on the instance type for this flag. There's no specific check for EFA. The toolkit is mostly run in controlled environments (DLCs), and checking specifically on libfabric, OFI, or EFA isn't easy. It's not in the scope of this PR.
Issue #, if available:
Without the
FI_EFA_USE_DEVICE_RDMA=1
, the NCCL performance is many times slower than with the flag. The flag was removed in https://github.com/aws/sagemaker-training-toolkit/pull/106 because of incompatibility, but should be re-enabled with latestEFA
with updatedlibfabric
and newerOFI
packages. This PR also updates and fixesblack
lint issues.Description of changes:
Add back
FI_EFA_USE_DEVICE_RDMA=1
flag for faster NCCL performance insmdataparallel
launcher and unit tests.feature: add back FI_EFA_USE_DEVICE_RDMA=1 flag, revert 2936f22 fix: fixed the
black
lint, upgraded black to version 21.3.0 fix: remove u prefix of strings, as python3 defaults to unicode stringsnote: EFA is only available on p3dn or p4dn instances note: EFA version 1.15.1 and OFI 1.1.5-aws have the issue fixed note: black format reference on remove u prefix https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#strings
Testing done:
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.