Open d4l3k opened 6 months ago
Issue seems to specifically be nvidiactl
. Some of the accesses aren't being cleared
Before
~/D/torch-criu (main)> lsof -p 74739 | rg nvidia
pt_main_t 74739 rice mem CHR 195,0 940 /dev/nvidia0
pt_main_t 74739 rice mem CHR 195,255 938 /dev/nvidiactl
pt_main_t 74739 rice mem CHR 237,0 894 /dev/nvidia-uvm
pt_main_t 74739 rice mem REG 254,1 2078360 2973095 /usr/lib/libnvidia-ml.so.550.76
pt_main_t 74739 rice mem CHR 195,1 976 /dev/nvidia1
pt_main_t 74739 rice 8u CHR 195,255 0t0 938 /dev/nvidiactl
pt_main_t 74739 rice 9u CHR 237,0 0t0 894 /dev/nvidia-uvm
pt_main_t 74739 rice 10u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 11u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 12u CHR 195,1 0t0 976 /dev/nvidia1
pt_main_t 74739 rice 13u CHR 195,1 0t0 976 /dev/nvidia1
pt_main_t 74739 rice 14u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 15u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 16u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 17u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 19u CHR 195,255 0t0 938 /dev/nvidiactl
pt_main_t 74739 rice 20u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 21u CHR 195,1 0t0 976 /dev/nvidia1
pt_main_t 74739 rice 22u CHR 237,0 0t0 894 /dev/nvidia-uvm
pt_main_t 74739 rice 23r CHR 240,2 0t0 981 /dev/nvidia-caps/nvidia-cap2
pt_main_t 74739 rice 24u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 25u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 26u CHR 195,1 0t0 976 /dev/nvidia1
pt_main_t 74739 rice 27u CHR 195,1 0t0 976 /dev/nvidia1
pt_main_t 74739 rice 29u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 30u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 31u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 32u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 33u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 34u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 35u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 36u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 39u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 41u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 43u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 45u CHR 195,0 0t0 940 /dev/nvidia0
After
~/D/torch-criu (main)> lsof -p 74739 | rg nvidia
pt_main_t 74739 rice mem REG 254,1 2078360 2973095 /usr/lib/libnvidia-ml.so.550.76
pt_main_t 74739 rice 19u CHR 195,255 0t0 938 /dev/nvidiactl
pt_main_t 74739 rice 20u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 21u CHR 195,1 0t0 976 /dev/nvidia1
pt_main_t 74739 rice 22u CHR 237,0 0t0 894 /dev/nvidia-uvm
pt_main_t 74739 rice 23r CHR 240,2 0t0 981 /dev/nvidia-caps/nvidia-cap2
pt_main_t 74739 rice 24u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 25u CHR 195,0 0t0 940 /dev/nvidia0
pt_main_t 74739 rice 26u CHR 195,1 0t0 976 /dev/nvidia1
pt_main_t 74739 rice 27u CHR 195,1 0t0 976 /dev/nvidia1
I've also rebuilt PyTorch from source with cuda 12.4.1, cudnn 8.9.7.29-1 and hit the same issue.
This seems to work on Jax so there's something odd going on with PyTorch
I hacked around this by modifying criu
to discard those FDs without errors -- it is able to checkpoint now but I'm not sure how safe it is
diff --git a/criu/files-ext.c b/criu/files-ext.c
index 95ec8e37c..2a150c546 100644
--- a/criu/files-ext.c
+++ b/criu/files-ext.c
@@ -90,7 +90,9 @@ int dump_unsupp_fd(struct fd_parms *p, int lfd, char *more, char *info, FdinfoEn
ret = do_dump_gen_file(p, lfd, &ext_dump_ops, e);
if (ret == 0)
return 0;
- if (ret == -ENOTSUP)
+ if (ret == -ENOTSUP) {
pr_err("Can't dump file %d of that type [%o] (%s %s)\n", p->fd, p->stat.st_mode, more, info);
+ return 0;
+ }
return -1;
}
diff --git a/criu/files.c b/criu/files.c
index 3b653e24b..2ea8ac3ef 100644
--- a/criu/files.c
+++ b/criu/files.c
@@ -847,7 +847,7 @@ int collect_fd(int pid, FdinfoEntry *e, struct rst_info *rst_info, bool fake)
fdesc = find_file_desc(e);
if (fdesc == NULL) {
pr_err("No file for fd %d id %#x\n", e->fd, e->id);
- return -1;
+ return 0;
}
if (!collect_fd_to(pid, e, rst_info, fdesc, fake, false))
This is a known issue and we're working on fixing it!
This is a known issue and we're working on fixing it!
This feature is truly helpful! Could you please share if there's a rough timeline or estimated date for this feature to be implemented?
@sgurfinkel any update on this?
@jesus-ramos is there a rough timeline on when PyTorch support will land?
I hacked around this by modifying
criu
to discard those FDs without errors -- it is able to checkpoint now but I'm not sure how safe it isdiff --git a/criu/files-ext.c b/criu/files-ext.c index 95ec8e37c..2a150c546 100644 --- a/criu/files-ext.c +++ b/criu/files-ext.c @@ -90,7 +90,9 @@ int dump_unsupp_fd(struct fd_parms *p, int lfd, char *more, char *info, FdinfoEn ret = do_dump_gen_file(p, lfd, &ext_dump_ops, e); if (ret == 0) return 0; - if (ret == -ENOTSUP) + if (ret == -ENOTSUP) { pr_err("Can't dump file %d of that type [%o] (%s %s)\n", p->fd, p->stat.st_mode, more, info); + return 0; + } return -1; } diff --git a/criu/files.c b/criu/files.c index 3b653e24b..2ea8ac3ef 100644 --- a/criu/files.c +++ b/criu/files.c @@ -847,7 +847,7 @@ int collect_fd(int pid, FdinfoEntry *e, struct rst_info *rst_info, bool fake) fdesc = find_file_desc(e); if (fdesc == NULL) { pr_err("No file for fd %d id %#x\n", e->fd, e->id); - return -1; + return 0; } if (!collect_fd_to(pid, e, rst_info, fdesc, fake, false))
This works fine for fds tied to CUDA devices, but it struggles with PyTorch programs using pinned memory, which is commonly used to speed up data transmission. It's still a bit far from being fully practical...
Is this is still on the roadmap?
Is this is still on the roadmap?
Yes, it is!
@sgurfinkel Hi, can we know when the next version will be released? And will the code be open source in next release?
Hey @sgurfinkel is this fixed on 565.57.01
?
Hey @sgurfinkel is this fixed on
565.57.01
?
No, not quite yet!
Thanks for the update anyways :) A Google engineer to us indicated it may be fixed on the latest driver.
@sgurfinkel would NVIDIA be up for doing community-focused video meeting for this project? I'm thinking something similar to what the AWS Firecracker team did for planning NVIDIA GPU support.
We (at modal.com) are very excited about this technology but it's hard to adopt it with little visibility into the system or roadmap :)
Thanks for the update anyways :) A Google engineer to us indicated it may be fixed on the latest driver.
@sgurfinkel would NVIDIA be up for doing community-focused video meeting for this project? I'm thinking something similar to what the AWS Firecracker team did for planning NVIDIA GPU support.
We (at modal.com) are very excited about this technology but it's hard to adopt it with little visibility into the system or roadmap :)
I'll look into the video meeting, but I do otherwise have an update! Single-process pytorch support is planned to be released in early 2025!
@sgurfinkel when you say "single-process" does that mean things like NCCL won't be supported?
@sgurfinkel when you say "single-process" does that mean things like NCCL won't be supported?
Yes, that's right. CUDA IPC support won't be present in the early 2025 release.
I just tried this out on PyTorch and it seems to work for the cuda state but I'm hitting issues with criu when saving the parent process. It seems like the issue is with saving the nvidia driver in
criu
.Are there any plans to expand support for this with
criu
for common ML frameworks?There's no longer an active cuda process after toggling but still seems to have access to a
nvidia
device.The file that failed to save seems to be
nvidia
.Test script