NVIDIA / cuda-checkpoint

CUDA checkpoint and restore utility
Other
222 stars 13 forks source link

cuda-checkpoint toggle program does not respond when restoring #17

Open ToviHe opened 1 week ago

ToviHe commented 1 week ago

I use the image provided by the Llama-factory framework to run the codegeex4-all-9b model. The command is as follows The command is as follows

docker run --gpu '"device=3"' --ipc=host --ulimit memlock=-1 -itd -p 37860:7860 -p 38000:8000 -v /data/model:/data/model llama-factory:20240710 bash

When the container startup is completed, I enter the container startup model interface service.The command is as follows

llamafatory-cli api --model_name_or_path /data/model/codegeex4-all-9b --template codegeex4

The large model service was successfully started and can provide services to the outside world normally.

When everything is ready, I use cuda-checkpoint to try to freeze and thaw the GPU instance. The command is as follows(In the host, not inside the container)

./cuda-checkpoint --toggle --pid 93264

The command was executed successfully, and at the same time, through nvidia-smi, it was seen that no processes were occupied in the 3 cards. Then I tried to restore the environment through the following command

./cuda-checkpoint --toggle --pid 93264

It was found that the execution of the command was blocked and it never returned. Image

At this time, the process log in the container is as follows Image

At this time, the process information on the host is as follows Image

Can you help me find out what is causing this? What do I need to do to execute successfully?

jesus-ramos commented 1 week ago

I'll see about trying to repro this to see what could be going wrong.

A couple of things to try is make sure that cuda-checkpoint is run as root, try passing all devices through docker instead of a subset, and you can also try an r555 driver if possible. There are some bugs with partial device passthrough so for r555 you will have to pass all devices through to the container or restores will fail.

ToviHe commented 1 week ago

I'll see about trying to repro this to see what could be going wrong.

A couple of things to try is make sure that cuda-checkpoint is run as root, try passing all devices through docker instead of a subset, and you can also try an r555 driver if possible. There are some bugs with partial device passthrough so for r555 you will have to pass all devices through to the container or restores will fail.

Thank you for your reply. The user who runs cuda-checkpoint here is root. how to specifically operate the 'try passing all devices through docker instead of a subset and r555 driver ' mentioned here?

jesus-ramos commented 1 week ago

You can use the "--gpus all" flag instead, for the latest driver version you can either use your distributions package manager to get the latest available or download directly from the nvidia website.

ToviHe commented 1 week ago

You can use the "--gpus all" flag instead, for the latest driver version you can either use your distributions package manager to get the latest available or download directly from the nvidia website.

The GPU we are currently assigned to use only has one card, that is, device=3 can only be specified. Regarding upgrading the driver version, currently because other services are running on the server, it can only be upgraded to version 550 for the time being. Is there any other way?