NVIDIA / cuda-checkpoint

CUDA checkpoint and restore utility
Other
223 stars 13 forks source link

pytorch support #4

Open d4l3k opened 6 months ago

d4l3k commented 6 months ago

I just tried this out on PyTorch and it seems to work for the cuda state but I'm hitting issues with criu when saving the parent process. It seems like the issue is with saving the nvidia driver in criu.

Are there any plans to expand support for this with criu for common ML frameworks?

~/D/torch-criu (main)> third-party/cuda-checkpoint/bin/x86_64_Linux/cuda-checkpoint --toggle --pid 125704
~/D/torch-criu (main)> sudo criu dump --shell-job --images-dir demo --tree 125704
Error (criu/files-ext.c:94): Can't dump file 19 of that type [20666] (chr 195:255)
Error (criu/cr-dump.c:1669): Dump files (pid: 125704) failed with -1
Error (criu/cr-dump.c:2093): Dumping FAILED.

There's no longer an active cuda process after toggling but still seems to have access to a nvidia device.

~/D/torch-criu (main) [1]> nvidia-smi
Mon Apr 29 17:11:21 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:06:00.0 Off |                  N/A |
|  0%   32C    P8             16W /  350W |       5MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:07:00.0 Off |                  N/A |
|  0%   34C    P8             21W /  350W |       5MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

The file that failed to save seems to be nvidia.

~/D/torch-criu (main)> grep 195 /proc/devices 
195 nvidia
195 nvidia-modeset
195 nvidiactl

Test script

import time
import os
import torch

device = torch.device("cuda")

a = torch.tensor(10, device=device)

print(os.getpid())
time.sleep(1000)
~/D/torch-criu (main)> criu -V
Version: 3.18
GitID: v3.18
d4l3k commented 6 months ago

Issue seems to specifically be nvidiactl. Some of the accesses aren't being cleared

Before

~/D/torch-criu (main)> lsof -p 74739 | rg nvidia
pt_main_t 74739 rice mem       CHR              195,0                940 /dev/nvidia0
pt_main_t 74739 rice mem       CHR            195,255                938 /dev/nvidiactl
pt_main_t 74739 rice mem       CHR              237,0                894 /dev/nvidia-uvm
pt_main_t 74739 rice mem       REG              254,1   2078360  2973095 /usr/lib/libnvidia-ml.so.550.76
pt_main_t 74739 rice mem       CHR              195,1                976 /dev/nvidia1
pt_main_t 74739 rice   8u      CHR            195,255       0t0      938 /dev/nvidiactl
pt_main_t 74739 rice   9u      CHR              237,0       0t0      894 /dev/nvidia-uvm
pt_main_t 74739 rice  10u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  11u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  12u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  13u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  14u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  15u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  16u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  17u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  19u      CHR            195,255       0t0      938 /dev/nvidiactl
pt_main_t 74739 rice  20u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  21u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  22u      CHR              237,0       0t0      894 /dev/nvidia-uvm
pt_main_t 74739 rice  23r      CHR              240,2       0t0      981 /dev/nvidia-caps/nvidia-cap2
pt_main_t 74739 rice  24u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  25u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  26u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  27u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  29u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  30u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  31u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  32u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  33u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  34u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  35u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  36u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  39u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  41u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  43u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  45u      CHR              195,0       0t0      940 /dev/nvidia0

After

~/D/torch-criu (main)> lsof -p 74739 | rg nvidia
pt_main_t 74739 rice mem       REG              254,1   2078360  2973095 /usr/lib/libnvidia-ml.so.550.76
pt_main_t 74739 rice  19u      CHR            195,255       0t0      938 /dev/nvidiactl
pt_main_t 74739 rice  20u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  21u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  22u      CHR              237,0       0t0      894 /dev/nvidia-uvm
pt_main_t 74739 rice  23r      CHR              240,2       0t0      981 /dev/nvidia-caps/nvidia-cap2
pt_main_t 74739 rice  24u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  25u      CHR              195,0       0t0      940 /dev/nvidia0
pt_main_t 74739 rice  26u      CHR              195,1       0t0      976 /dev/nvidia1
pt_main_t 74739 rice  27u      CHR              195,1       0t0      976 /dev/nvidia1
d4l3k commented 6 months ago

I've also rebuilt PyTorch from source with cuda 12.4.1, cudnn 8.9.7.29-1 and hit the same issue.

d4l3k commented 6 months ago

This seems to work on Jax so there's something odd going on with PyTorch

d4l3k commented 6 months ago

I hacked around this by modifying criu to discard those FDs without errors -- it is able to checkpoint now but I'm not sure how safe it is

diff --git a/criu/files-ext.c b/criu/files-ext.c
index 95ec8e37c..2a150c546 100644
--- a/criu/files-ext.c
+++ b/criu/files-ext.c
@@ -90,7 +90,9 @@ int dump_unsupp_fd(struct fd_parms *p, int lfd, char *more, char *info, FdinfoEn
ret = do_dump_gen_file(p, lfd, &ext_dump_ops, e);
if (ret == 0)
return 0;
-       if (ret == -ENOTSUP)
+       if (ret == -ENOTSUP) {
pr_err("Can't dump file %d of that type [%o] (%s %s)\n", p->fd, p->stat.st_mode, more, info);
+       return 0;
+       }
return -1;
}
diff --git a/criu/files.c b/criu/files.c
index 3b653e24b..2ea8ac3ef 100644
--- a/criu/files.c
+++ b/criu/files.c
@@ -847,7 +847,7 @@ int collect_fd(int pid, FdinfoEntry *e, struct rst_info *rst_info, bool fake)
fdesc = find_file_desc(e);
if (fdesc == NULL) {
pr_err("No file for fd %d id %#x\n", e->fd, e->id);
-               return -1;
+               return 0;
}
if (!collect_fd_to(pid, e, rst_info, fdesc, fake, false))
sgurfinkel commented 6 months ago

This is a known issue and we're working on fixing it!

ZingLix commented 6 months ago

This is a known issue and we're working on fixing it!

This feature is truly helpful! Could you please share if there's a rough timeline or estimated date for this feature to be implemented?

ethxnp commented 3 months ago

@sgurfinkel any update on this?

thundergolfer commented 2 months ago

@jesus-ramos is there a rough timeline on when PyTorch support will land?

913887524gsd commented 1 month ago

I hacked around this by modifying criu to discard those FDs without errors -- it is able to checkpoint now but I'm not sure how safe it is

diff --git a/criu/files-ext.c b/criu/files-ext.c
index 95ec8e37c..2a150c546 100644
--- a/criu/files-ext.c
+++ b/criu/files-ext.c
@@ -90,7 +90,9 @@ int dump_unsupp_fd(struct fd_parms *p, int lfd, char *more, char *info, FdinfoEn
ret = do_dump_gen_file(p, lfd, &ext_dump_ops, e);
if (ret == 0)
return 0;
-     if (ret == -ENOTSUP)
+     if (ret == -ENOTSUP) {
pr_err("Can't dump file %d of that type [%o] (%s %s)\n", p->fd, p->stat.st_mode, more, info);
+         return 0;
+     }
return -1;
}
diff --git a/criu/files.c b/criu/files.c
index 3b653e24b..2ea8ac3ef 100644
--- a/criu/files.c
+++ b/criu/files.c
@@ -847,7 +847,7 @@ int collect_fd(int pid, FdinfoEntry *e, struct rst_info *rst_info, bool fake)
fdesc = find_file_desc(e);
if (fdesc == NULL) {
pr_err("No file for fd %d id %#x\n", e->fd, e->id);
-             return -1;
+             return 0;
}
if (!collect_fd_to(pid, e, rst_info, fdesc, fake, false))

This works fine for fds tied to CUDA devices, but it struggles with PyTorch programs using pinned memory, which is commonly used to speed up data transmission. It's still a bit far from being fully practical...

gflarity commented 3 weeks ago

Is this is still on the roadmap?

sgurfinkel commented 3 weeks ago

Is this is still on the roadmap?

Yes, it is!

lianghao208 commented 3 weeks ago

@sgurfinkel Hi, can we know when the next version will be released? And will the code be open source in next release?

thundergolfer commented 1 week ago

Hey @sgurfinkel is this fixed on 565.57.01?

sgurfinkel commented 1 week ago

Hey @sgurfinkel is this fixed on 565.57.01?

No, not quite yet!

thundergolfer commented 1 week ago

Thanks for the update anyways :) A Google engineer to us indicated it may be fixed on the latest driver.

@sgurfinkel would NVIDIA be up for doing community-focused video meeting for this project? I'm thinking something similar to what the AWS Firecracker team did for planning NVIDIA GPU support.

We (at modal.com) are very excited about this technology but it's hard to adopt it with little visibility into the system or roadmap :)

sgurfinkel commented 1 week ago

Thanks for the update anyways :) A Google engineer to us indicated it may be fixed on the latest driver.

@sgurfinkel would NVIDIA be up for doing community-focused video meeting for this project? I'm thinking something similar to what the AWS Firecracker team did for planning NVIDIA GPU support.

We (at modal.com) are very excited about this technology but it's hard to adopt it with little visibility into the system or roadmap :)

I'll look into the video meeting, but I do otherwise have an update! Single-process pytorch support is planned to be released in early 2025!

d4l3k commented 4 days ago

@sgurfinkel when you say "single-process" does that mean things like NCCL won't be supported?

sgurfinkel commented 4 days ago

@sgurfinkel when you say "single-process" does that mean things like NCCL won't be supported?

Yes, that's right. CUDA IPC support won't be present in the early 2025 release.