aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.07k stars 1.13k forks source link

Enable processing and/or memory optimized instances when using sagemaker.remote_function's @remote decorator #4040

Open diegodebrito opened 12 months ago

diegodebrito commented 12 months ago

Describe the feature you'd like Currently only training instances are allowed when using @remote (from the sagemaker.remote_function module). This module can be used for processing tasks as well, so it would be useful to have more instance types available.

How would this feature be used? Please describe. Using instances with more than 256GB that don't need GPU acceleration for processing tasks. These are only available as Processing instances as far as I know (and are referred as Memory optimized instances there).

Describe alternatives you've considered We can use Sagemaker Processing jobs, and we currently do that. The downside is that local mode is not enabled when using Sagemaker studio, so it can be a little clunky to develop scripts locally before submitting then to a processing task. This is much easier when using @remote, since we can execute code directly without the need of mapping inputs/outputs, etc. Code for local testing and remote execution could be very similar if not identical in this case.

Additional context In case this is not clear, I'm referring to this functionality: https://docs.aws.amazon.com/sagemaker/latest/dg/train-remote-decorator.html

Link to the instance types available here: https://aws.amazon.com/sagemaker/pricing/

@jmahlik

jmahlik commented 11 months ago

This seems possible to implement if there were a processing job code path.

Maybe by splitting out the _Job class in to two classes, one for _TrainingJob and one for _ProcessingJob in sagemaker.remote_function.job.py? That seems like the main point of difference. Some of the references to training would need to be updated in sagemaker.remote_function.client. Those could be encapsulated in the _Job class.

We could start a skeleton of a PR for this but would like thoughts before going that route.