Support Neuron SDK model cache for Trainium training jobs

aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker

Apache License 2.0

2.07k stars 1.12k forks source link

The Neuron SDK compiles models for the Trainium accelerator. The compiled model is then stored in a disk cache, which saves time for further training jobs.

With SageMaker, the model is recompiled every time, as the model cache is stored inside the training container and lost across jobs. It'd be great to be able to automatically save the model cache to an S3 bucket, in order to pass it to the Estimator for the next training job.

I guess this is doable today by adding bespoke code in the training script, but a built-in feature would be nicer.

aws / sagemaker-python-sdk

Support Neuron SDK model cache for Trainium training jobs #3481