aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.07k stars 1.12k forks source link

Support Neuron SDK model cache for Trainium training jobs #3481

Closed juliensimon closed 6 months ago

juliensimon commented 1 year ago

The Neuron SDK compiles models for the Trainium accelerator. The compiled model is then stored in a disk cache, which saves time for further training jobs.

With SageMaker, the model is recompiled every time, as the model cache is stored inside the training container and lost across jobs. It'd be great to be able to automatically save the model cache to an S3 bucket, in order to pass it to the Estimator for the next training job.

I guess this is doable today by adding bespoke code in the training script, but a built-in feature would be nicer.

akrishna1995 commented 6 months ago

We have added your feature request to our backlog of feature requests and may consider putting it into future SDK versions. I will go ahead and close the issue now, please let me know if you have any more feedback.