aws / sagemaker-training-toolkit

Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
496 stars 118 forks source link

Passing SIGTERM to entrypoint to be able to handle SPOT failures gracefully in user-code #173

Open croth1 opened 1 year ago

croth1 commented 1 year ago

Describe the feature you'd like I would love to be able to make use of SIGTERM handling used by modern ML frameworks such as pytorch_lightning. If I understood correctly, when the spot failure is announced, the container receives a SIGTERM and has 120 seconds time before it is forcefully terminated. I would like to be able to get the signal passed down to the entry point in order to make use of the SIGTERM handling callbacks provided by those frameworks.

How would this feature be used? Please describe. The 120 seconds can be used for writing out a checkpoint and gracefully terminating the experiment when using an experiment tracker.

Describe alternatives you've considered One can just not use the last 120 seconds and start from the last checkpoint written out by the model and just accept that spot instance failures are marked as "failed" experiments in the MLflow experiment tracker.

Additional context During my journey getting to the bottom of this problem, I created a small proof-of-concept what changes would be necessary to make it work in my specific case (i.e. being able to handle SIGTERMs in a shell script entry point (which could be passed down to pytorch_lightning training script), see here for an example: https://github.com/croth1/sagemaker-toolkit-sigterm-handling

Only few changes are necessary to make this work in my specific case, see: https://github.com/aws/sagemaker-training-toolkit/compare/master...croth1:sigterm_forwarding?expand=1. HOWEVER this is just a proof-of-concept as there are many paths in the code-base eventually leading to entrypoint execution and this is fixing only the one I used.

I hope this is of interest :)