aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.09k stars 1.14k forks source link

@ remote support for multi-instance training job #4125

Open sateeshmannar opened 1 year ago

sateeshmannar commented 1 year ago

Describe the feature you'd like Need the ability to use @ remote to train on a multi-instance node for distributed training.

How would this feature be used? Please describe. Distributed training packages like h2o can be used with @ remote. Currently @ remote restrict the instance count for training job to "One" instance

Describe alternatives you've considered Use sagemaker.estimator.Estimator to configure distributed training job. This requires duplication of code when switching between local mode vs Instance based training.

Additional context We are in the process of switching from SageMaker notebook instance to SageMaker Studio. SageMaker studio does not support local mode at this time. So, in order to test with local mode we are using @ remote. However, to train on large datasets we use distributed training. In Sagemaker Notebook Instance env, sagemaker.estimator.Estimator easily allowed us to switch between local and multi-instance based training. However, not having a SDK function for a comparable local/distributed training option in studio is causing a lot of rework of templates. Enhancing @ remote to train on multi-instance would mitigate the concern.

sateeshmannar commented 1 year ago

Is there a reaason why spark is enabled for multi-instance and not training jobs? sagemaker/remote_function/client.py - wrapper function image