NREL / buildstockbatch

Other
20 stars 13 forks source link

Checkpoint batch job on cloud implementations #433

Open nmerket opened 4 months ago

nmerket commented 4 months ago

We use the spot market to save money on batch simulations. Problem is that the jobs can be interrupted in the spot market. At this point we just start over each job when that happens, but that can cause problems in ComStock with larger building models and longer running jobs. There is a way to get warning and to checkpoint our work within a job. AWS has a blog post about it. The "inside a container on ECS" is the most relevant section. Basically we catch the SIGTERM signal using the signal library in python and save our progress to S3, then when the job is retried, it checks for that progress and picks up where it left off.

cc @asparke2