Open ToviHe opened 1 week ago
I'll see about trying to repro this to see what could be going wrong.
A couple of things to try is make sure that cuda-checkpoint is run as root, try passing all devices through docker instead of a subset, and you can also try an r555 driver if possible. There are some bugs with partial device passthrough so for r555 you will have to pass all devices through to the container or restores will fail.
I'll see about trying to repro this to see what could be going wrong.
A couple of things to try is make sure that cuda-checkpoint is run as root, try passing all devices through docker instead of a subset, and you can also try an r555 driver if possible. There are some bugs with partial device passthrough so for r555 you will have to pass all devices through to the container or restores will fail.
Thank you for your reply. The user who runs cuda-checkpoint here is root. how to specifically operate the 'try passing all devices through docker instead of a subset and r555 driver ' mentioned here?
You can use the "--gpus all" flag instead, for the latest driver version you can either use your distributions package manager to get the latest available or download directly from the nvidia website.
You can use the "--gpus all" flag instead, for the latest driver version you can either use your distributions package manager to get the latest available or download directly from the nvidia website.
The GPU we are currently assigned to use only has one card, that is, device=3 can only be specified. Regarding upgrading the driver version, currently because other services are running on the server, it can only be upgraded to version 550 for the time being. Is there any other way?
I use the image provided by the Llama-factory framework to run the codegeex4-all-9b model. The command is as follows The command is as follows
When the container startup is completed, I enter the container startup model interface service.The command is as follows
The large model service was successfully started and can provide services to the outside world normally.
When everything is ready, I use cuda-checkpoint to try to freeze and thaw the GPU instance. The command is as follows(In the host, not inside the container)
The command was executed successfully, and at the same time, through nvidia-smi, it was seen that no processes were occupied in the 3 cards. Then I tried to restore the environment through the following command
It was found that the execution of the command was blocked and it never returned.
At this time, the process log in the container is as follows
At this time, the process information on the host is as follows
Can you help me find out what is causing this? What do I need to do to execute successfully?