aws / sagemaker-training-toolkit

Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
496 stars 118 forks source link

add back FI_EFA_USE_DEVICE_RDMA=1 flag, revert 2936f22 #121

Closed ydaiming closed 2 years ago

ydaiming commented 2 years ago

Issue #, if available:

Without the FI_EFA_USE_DEVICE_RDMA=1, the NCCL performance is many times slower than with the flag. The flag was removed in https://github.com/aws/sagemaker-training-toolkit/pull/106 because of incompatibility, but should be re-enabled with latest EFA with updated libfabric and newer OFI packages. This PR also updates and fixes black lint issues.

Description of changes:

Add back FI_EFA_USE_DEVICE_RDMA=1 flag for faster NCCL performance in smdataparallel launcher and unit tests.

feature: add back FI_EFA_USE_DEVICE_RDMA=1 flag, revert 2936f22 fix: fixed the black lint, upgraded black to version 21.3.0 fix: remove u prefix of strings, as python3 defaults to unicode strings

note: EFA is only available on p3dn or p4dn instances note: EFA version 1.15.1 and OFI 1.1.5-aws have the issue fixed note: black format reference on remove u prefix https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#strings

Testing done:

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

Tests

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

sagemaker-bot commented 2 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 2 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 2 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 2 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 2 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 2 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 2 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

satishpasumarthi commented 2 years ago

LGTM !

rashika commented 2 years ago

Is there a related patch which confirms that the latest EFA installer is being used since this flag has dependency on that.

sagemaker-bot commented 2 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 2 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 2 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 2 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 2 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

ydaiming commented 2 years ago

@rashika, We have a check on the instance type for this flag. There's no specific check for EFA. The toolkit is mostly run in controlled environments (DLCs), and checking specifically on libfabric, OFI, or EFA isn't easy. It's not in the scope of this PR.