TraceMachina / nativelink

NativeLink is an open source high-performance build cache and remote execution server, compatible with Bazel, Buck2, Reclient, and other RBE-compatible build systems. It offers drastically faster builds, reduced test flakiness, and specialized hardware.
https://nativelink.com
Apache License 2.0
1.16k stars 109 forks source link

[Idea] Checkpoint support #224

Open allada opened 1 year ago

allada commented 1 year ago

Today I heard an interesting use case. Sometimes users may want to have processes that take a very long time, like training a ML model, but want to upload resume-able checkpoints that if the program is resumed it will resume from the last checkpoint.

Specific use case:

  1. Training program takes 3 days to run on a single GPU instance.
  2. The intermediate state can be quite like (100GB+), so uploads are slow.
  3. While the intermediate state is being uploaded, we want to keep the ML model training on the same GPU with same state.
  4. If the task is terminated turbo-cache should attempt to resume the process from the last saved state.
  5. A special ActionResult will be uploaded to AC for the task with a last_state tag in the hash (maybe environmental variable?). This will allow actions to be run against whatever the most recent state of the action cache is. For example to run some heuristics on the last model being trained (like TensorBoard).

Obviously this would be very difficult to implement right. It would be great if we could just snapshot memory state & files, upload it and allow it to be resumed, but certain things like GPU drivers present issues. We could easily do this by sending special signals to the program like: SIGUSR1, SIGUSR2, SIGVTALRM or exc, then the program would need to do the actions needed to save the resume-state files to disk then inform turbo-cache worker process it is done. TurboCache will then upload the state and the special "latest" ActionCache result.

This would obviously represent non-deterministic behavior, but it would be a configured parameter on the worker, so only use cases that specifically request this functionality would be allowed to use it (ie: opt-in to non-determinism).

Projects that do similar stuff: https://github.com/checkpoint-restore/criu

blasten commented 1 year ago

this is really cool. Temporal handles a similar use case too https://docs.temporal.io/temporal

blasten commented 1 year ago

criu is impressive. 🤯