Open rayrapetyan opened 9 months ago
Well it depends on various factors like-
p.s. correct me if i'm wrong!
In the described scenario, I believe CRIU doesn't need to be concerned about preserving GPU state and/or drivers. Its primary task should be to attach /dev/dri devices during restoration and allow the restored process to manage the situation. It is likely that the process might crash, but that is acceptable. It can be restarted, and the functionality of container should resume as intended.
@rayrapetyan The following projects implement support for GPU workloads with CRIU:
e.g. a video streaming service doing a live transcoding of a headless stream (so GPU is used only for video encoding purposes) may just crash on restore and will be restarted and continue to work as expected?
Note that some video streaming workloads (e.g., restreamer) can be used without GPU acceleration, and checkpoint/restore of GPU state might not be necessary.
@rayrapetyan What you suggest can work, but I think the application needs to support it. Assuming CRIU would support ignoring the state of the GPU, the application would need to be able to recover from a disconnected GPU. Restart the connection to GPU. The application also would need to regularly save the state of the GPU to be able to continue after restore. If you just restore the process without any awareness in the application the application would probably hang or crash.
Saving the state of the GPU regularly to be able to continue after restore means that you would need space to save the GPU state and time to save the GPU state. Time is important to mention because if you save the state of the GPU too often a lot of computation time might be lost waiting for the state to be saved. If you do not save if often enough a lot of computation on the GPU needs to be done again after restore. There are a lot of papers around best checkpoint intervals which would also apply to this case.
If you do not want to regularly save the state of the GPU you could create some kind of signalling where CRIU tells the application that it soon will be checkpointed, but that opens the application up to misuse of that signalling mechanism. This has been discussed in combination with Java but we never figured out a way how CRIU could signal an application about an upcoming checkpoint.
Overall I think, without awareness of the checkpointed application your proposal cannot work.
Looks like cricket is for Nvidia, CRIU already supports AMD, but my target GPU is Intel :) (iGPU built into CPU). As per my research, it's the most affordable and effective solution for transcoding/encoding type of workloads. Typical setup inside a container is: an application generating a media stream (e.g. a headless x11 server or wayland compositor) and a gstreamer-based frames capturer/streamer. The latter uses GPU for encoding (pure CPU encoders require more investment into hardware). It's ok for gstreamer to crash on restore, it will be restarted and in the worst case few streamed frames will be lost, not a big deal. If only CRIU could allow restoration of devices mounts, I'm sure many setups will be able to handle that situation with minimal loses...
So my question is - how hard would be reattach devices to the restored container without performing a GPU state restoration? If someone could guide me through the key points, I could try to implement and test it on my side and the create a PR.
A friendly reminder that this issue had no activity for 30 days.
It's well known that GPU workloads in general are not supported by CRIU, but what about cases when it's not a big deal - e.g. a video streaming service doing a live transcoding of a headless stream (so GPU is used only for video encoding purposes) may just crash on restore and will be restarted and continue to work as expected? How hard would be add support for these edge cases?