checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.95k stars 592 forks source link

Stateless GPU workloads #2326

Open rayrapetyan opened 9 months ago

rayrapetyan commented 9 months ago

It's well known that GPU workloads in general are not supported by CRIU, but what about cases when it's not a big deal - e.g. a video streaming service doing a live transcoding of a headless stream (so GPU is used only for video encoding purposes) may just crash on restore and will be restarted and continue to work as expected? How hard would be add support for these edge cases?

konpal-sharma commented 9 months ago

Well it depends on various factors like-

  1. State Preservation : CRIU primarily focuses on user space process checkpointing and may not have built-in support for capturing and restoring GPU state.
  2. Device-Specific challenges : Different GPUs from various manufacturers may have unique architectures, drivers, and APIs. Creating a generic solution that works seamlessly across various GPU devices could be complex.
  3. Kernel and Driver Support : GPU support often requires cooperation from the underlying Linux kernel and GPU drivers. Ensuring compatibility with different kernel versions and GPU drivers adds an extra layer of complexity.

p.s. correct me if i'm wrong!

rayrapetyan commented 9 months ago

In the described scenario, I believe CRIU doesn't need to be concerned about preserving GPU state and/or drivers. Its primary task should be to attach /dev/dri devices during restoration and allow the restored process to manage the situation. It is likely that the process might crash, but that is acceptable. It can be restarted, and the functionality of container should resume as intended.

rst0git commented 9 months ago

@rayrapetyan The following projects implement support for GPU workloads with CRIU:

rst0git commented 9 months ago

e.g. a video streaming service doing a live transcoding of a headless stream (so GPU is used only for video encoding purposes) may just crash on restore and will be restarted and continue to work as expected?

Note that some video streaming workloads (e.g., restreamer) can be used without GPU acceleration, and checkpoint/restore of GPU state might not be necessary.

adrianreber commented 9 months ago

@rayrapetyan What you suggest can work, but I think the application needs to support it. Assuming CRIU would support ignoring the state of the GPU, the application would need to be able to recover from a disconnected GPU. Restart the connection to GPU. The application also would need to regularly save the state of the GPU to be able to continue after restore. If you just restore the process without any awareness in the application the application would probably hang or crash.

Saving the state of the GPU regularly to be able to continue after restore means that you would need space to save the GPU state and time to save the GPU state. Time is important to mention because if you save the state of the GPU too often a lot of computation time might be lost waiting for the state to be saved. If you do not save if often enough a lot of computation on the GPU needs to be done again after restore. There are a lot of papers around best checkpoint intervals which would also apply to this case.

If you do not want to regularly save the state of the GPU you could create some kind of signalling where CRIU tells the application that it soon will be checkpointed, but that opens the application up to misuse of that signalling mechanism. This has been discussed in combination with Java but we never figured out a way how CRIU could signal an application about an upcoming checkpoint.

Overall I think, without awareness of the checkpointed application your proposal cannot work.

rayrapetyan commented 9 months ago

Looks like cricket is for Nvidia, CRIU already supports AMD, but my target GPU is Intel :) (iGPU built into CPU). As per my research, it's the most affordable and effective solution for transcoding/encoding type of workloads. Typical setup inside a container is: an application generating a media stream (e.g. a headless x11 server or wayland compositor) and a gstreamer-based frames capturer/streamer. The latter uses GPU for encoding (pure CPU encoders require more investment into hardware). It's ok for gstreamer to crash on restore, it will be restarted and in the worst case few streamed frames will be lost, not a big deal. If only CRIU could allow restoration of devices mounts, I'm sure many setups will be able to handle that situation with minimal loses...

So my question is - how hard would be reattach devices to the restored container without performing a GPU state restoration? If someone could guide me through the key points, I could try to implement and test it on my side and the create a PR.

github-actions[bot] commented 8 months ago

A friendly reminder that this issue had no activity for 30 days.