aws / sagemaker-training-toolkit

Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
496 stars 118 forks source link

Add 'ml.p5.48xlarge' as a supported instance for SM_EFA_NCCL_INSTANCES. #219

Open andjsmi opened 2 months ago

andjsmi commented 2 months ago

Describe the feature you'd like Add "ml.p5.48xlarge" to the list of SM_EFA_NCCL_INSTANCES. This means SageMaker Training Toolkit sets FI_PROVIDER=efa automatically for use in training jobs to ensure EFA is used.

How would this feature be used? Please describe. Automatically setting EFA for use in SageMaker Training Jobs that leverage the SageMaker Training Toolkit

Describe alternatives you've considered Unsure. Currently you would manually set the environment variable.

Additional context NA