Saving checkpoints at interrupt

aws-samples / amazon-sagemaker-managed-spot-training

Amazon SageMaker Managed Spot Training Examples

MIT No Attribution

50 stars 17 forks source link

Saving checkpoints at interrupt #3

Open Ridhamz-nd opened 7 months ago

Ridhamz-nd commented 7 months ago

Thank you for providing example implementations!

I was wondering what signal is sent to the docker container when spot training job are interrupted. Is it SIGKILL or SIGTERM with some grace period (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StopTrainingJob.html)?

I was looking to implement a signal handler which, on SIGTERM, saves the latest checkpoint to S3. That way, resume happens from the exact point in time. Is this possible? Do we need to account for the time it takes for the uploader service to upload the content of /opt/ml/checkpoints to the checkpoint_s3_uri?

Any guidelines on how to resume from the latest stop point is much appreciated

Ridhamz-nd commented 7 months ago

@eitansela please lmk if this is not the correct location to ask this question and I can close this issue :)

rst0git commented 7 months ago

I was looking to implement a signal handler which, on SIGTERM, saves the latest checkpoint to S3. That way, resume happens from the exact point in time. Is this possible?

@Ridhamz-nd What would happen if SIGKILL is used instead? You would also need to make sure that a checkpoint is created only when it is necessary, not every time when SIGTERM is used as this may introduce significant performance overhead.

Ridhamz-nd commented 7 months ago

@rst0git I don't think a signal handler can be attached to a SIGKILL signal (https://man7.org/linux/man-pages/man7/signal.7.html). Once a SIGKILL is sent, the process is terminated immediately. Based on the sagemaker docs, SIGTERM is only sent once with a grace period of 120s.

eitansela commented 7 months ago

You should save checkpoint to /opt/ml/checkpoints after each EPOC, and SageMaker takes care to copy it to checkpoint_s3_uri for you. It is not a matter of speed because if it is a long training job of few hours or few days, why a SIGTERM will help here? If a Spot goes down, you lose few minutes of training and resume back after you have a new Spot, from the last checkpoint.

Ridhamz-nd commented 6 months ago

So you are right in that I will only lose a few minutes of training if I'm training on one node. However, if I am training on p4d/p5 instances which have a > 20% interruption rate in most regions and if I'm doing multi node training, then if one node is reclaimed, the whole job needs to be paused. In this case, there can be too many interrupts.

Also, in general, its preferable to not lose training progress that a job may have made. Currently we try to counter the lose progress issue by checkpointing frequently but that also has a cost (especially for large models) so its much more convenient if we get some sort of a signal that tells us that our job is going to be interrupted. If SIGTERM is that signal based on the docs, then we can save and resume from same point.