GPU checkpoint should work now. I have added an example workload sdk/examples/demo/cr.py, which loads a GPT2 model using on_start.
As per the previous PR (#509), you can try either serve or deploy.
Changes made to make this possible, for now:
Updated the worker and runner images to ubuntu 22.04. Our GPU binaries are currently not compatible with older glibc; we are still working on this.
Updated worker image to CUDA 12.4.
Updated worker image to include CuDNN. This increases the size of the image. Note that this won't be needed in a future version of cedana, as our architecture will be based on Driver API directly. It's already under testing right now.
Existing bugs:
The synchronization logic in sdk/src/beta9/runner/endpoint.py is failing.
Future improvements:
C/R managed fully from the worker, with no trace of cedana or CRIU in the runner.
Driver API update, no longer requiring passing of the --cuda version to cedana, as long as the NVIDIA driver on the host is compatible.
GPU checkpoint should work now. I have added an example workload
sdk/examples/demo/cr.py
, which loads a GPT2 model usingon_start
.As per the previous PR (#509), you can try either
serve
ordeploy
.Changes made to make this possible, for now:
CuDNN
. This increases the size of the image. Note that this won't be needed in a future version of cedana, as our architecture will be based on Driver API directly. It's already under testing right now.Existing bugs:
sdk/src/beta9/runner/endpoint.py
is failing.Future improvements:
--cuda
version to cedana, as long as the NVIDIA driver on the host is compatible.