checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.79k stars 565 forks source link

Criu Restore failing with sudo #2269

Closed rerrabolu closed 9 months ago

rerrabolu commented 10 months ago

master@hpe-sjc2-03:Tmp$ criu --version Version: 3.18 GitID: v3.18-143-g4c845c885

Restore Cmd: sudo criu restore -vvv -o restore.log --shell-job --link-remap && echo "Restore OK"

The restore is applied to a two process tree involving a GPU (amdgpu plugin). Each process generates 3 IDs for the various VMA associated with each process. The restore goes through the restoration of each ID file. Criu framework encounters an error as given below before invoking the two processes to "resume'

(01.355179) 11592: Further fle=0x7f48c9bdc748, pid=11592 (01.355185) 11592: Error (criu/util.c:246): fd 0 already in use (called at criu/files.c:992) (01.355189) 11592: Error (criu/files.c:1177): Can't get fd=0, pid=11592 (01.355193) 11592: Error (criu/files.c:1215): Unable to open fd=0 id=0x11 (01.375880) 11591: Error (criu/cr-restore.c:1513): 11592 exited, status=1 (01.375984) Error (criu/cr-restore.c:2557): Restoring FAILED.

This error does not occur on another two process workload that generates only one ID file per process. The ID file refers to KFD device. In the failing case the two extra files refer to DRM devices.

Any suggestions on how to understand / debug this issue is appreciated

rerrabolu commented 10 months ago

The file at the root of the error trigger is said to be: tty[8801:17]

avagin commented 10 months ago

Could you attach the full criu log? It should help to figure out the issue.

avagin commented 10 months ago

I figured out where the problem is. I think we need to do something like: https://github.com/avagin/criu/commit/31b4997b628e4a5556332fd12a5b49864e52aab1

Unfortunately, I don't have hardware to test the patch. @fdavid-amd and @dayatsin-amd could you review and finalize the change? It is probably requires some minor fixes.

rerrabolu commented 10 months ago

Please find attached logs from the run that end up on "restore" failing

Let me know if the logs have enough information. I will take a look at the patch provided and see if I can try it out.

dayatsin-amd commented 10 months ago

Thanks for the patch @avagin

From what I remember there was something we had to do to restore the tty devices with the --inherit-fd option: https://criu.org/Inheriting_FDs_on_restore

@rerrabolu is also from AMD, will sync-up with up with him

avagin commented 10 months ago

@dayatsin-amd I looked at find_unused_fd one more time and I think I was trying to fix unrelated issue. We need to fix it too, but I don't think that it is triggered here. This issue looks more like a leaked file descriptor.

In the attached log, I found another error that looks suspicions:

(00.143629)   5116: plugin: `amdgpu_plugin' hook 3 -> 0x7f63b3970e91
(00.143633)   5116: amdgpu_plugin: Sairam: amdgpu_plugin_restore_file(), ID is: 29 Trying to open file: amdgpu-kfd-29.img
(00.143686)   5116: Error (amdgpu_plugin_util.c:116): amdgpu-kfd-29.img: Failed to open for read: No such file or directory
(00.143696)   5116: Sairam: amdgpu_plugin_drm_restore_file() Tid: 5116 Enter the PLUGIN FUNCTION with DRM File ID: 29
(00.143706)   5116: Sairam: amdgpu_plugin_drm_restore_file(), Restoring RenderD amdgpu-drm-29.img
(00.143726)   5116: amdgpu-drm-29.img:Opened file for read with size:133 
(00.143730)   5116: RenderD Image file size:133
(00.143742)   5116: Sairam: amdgpu_plugin_drm_restore_file(), FD sent to deserialize_device_state() is: 13 
(00.143747)   5116: Sairam: deserialize_device_state(), Number of BOs from Checkpoint: 2 
(00.143751)   5116: 

It seems a non-fatal error but you need to check that all file descriptors are closed properly. btw: I can't find amdgpu_plugin_drm_restore_file in the upstream criu.

rerrabolu commented 10 months ago

The error "(00.143686) 5116: Error (amdgpu_plugin_util.c:116): amdgpu-kfd-29.img: Failed to open for read: No such file or directory" can be ignored. It is a local error from trying to open a file whose name is constructed based on "id" passed into restore_file() method.

avagin commented 10 months ago

@rerrabolu could collect a strace log for criu restore?

You will need to run criu like this:

$ strace -fo strace.log -s 1024 criu restore ....

and then attach strace.log here.

rerrabolu commented 10 months ago

Attaching relevant logs as requested:

Dump Command: sudo criu dump -t "$pidVal" -j -vvv -o dump.log --ext-unix-sk --link-remap && echo "Checkpoint OK"

Restore Command: sudo strace -fo strace.log -s 1024 criu restore -vvv -o restore.log --shell-job --link-remap && echo "Restore OK"

Dump Log: ram_criu_dump_setsid_no_r4.log

Strace Log: strace.log

Log with Stderr: ram_criu_restore_inherit_fd-no_r4.log

rerrabolu commented 10 months ago

In addition I tried running the workload using setsid and inherit-fd option without luck

Launch Workload: setsid - - -

Restore Command: sudo criu restore -vvv -o restore.log --shell-job --link-remap --inherit-fd 'fd[1]:tty[8801:17]' && echo "Restore OK"

avagin commented 10 months ago

@rerrabolu oh, criu under strace has been broken. It failed in kerndat_has_ptrace_get_rseq_conf. Sorry for inconvenience. Can I ask you to run "criu restore" under "perf trace". It is similar with strace, but it doesn't depend on ptrace.

$ perf trace -o perf.log --syscalls criu restore ...
rerrabolu commented 10 months ago

Looks like "perf" needs linux-tools package. This will be a problem because the kernel build that has my driver changes is based on 6.0. and is a monolithic build. The apt repository does not show me a package for 6.0 that I could install for Ubuntu 20.04.

I am trying to install 5.15.0-46 and see if it helps. Will update if it works, doubt it

rerrabolu commented 10 months ago

I am going to print more debug data on the file/fd triggering the error. But what if any I can do with this information is not clear.

avagin commented 10 months ago

@rerrabolu I think a root cause of this issue is that we forgot to close one of file descriptors. Then when we started restoring process file descriptors, we found that the 0 fd is already occupied.

You can try to stop criu right after the error:

diff --git a/criu/util.c b/criu/util.c
index db96cf938..c8195a376 100644
--- a/criu/util.c
+++ b/criu/util.c
@@ -245,6 +245,7 @@ int reopen_fd_as_safe(char *file, int line, int new_fd, int old_fd, bool allow_r
                } else if (tmp != new_fd) {
                        close(tmp);
                        pr_err("fd %d already in use (called at %s:%d)\n", new_fd, file, line);
+                       sleep(100);
                        return -1;
                }

and then you will need to look at /proc/pid/fd of the target process to find out what file uses the 0 fd.

avagin commented 10 months ago

btw: I think I fixed the strace issue by https://github.com/checkpoint-restore/criu/pull/2273.

rerrabolu commented 10 months ago

Attaching log from CRIU framework showing relevant FDs that are involved in triggering failure. OpenFD-Restore.txt

dayatsin-amd commented 10 months ago

It is the /dmabuf files in the child process: lrwx------ 1 root root 64 Sep 30 01:16 0 -> /dmabuf:

The is new type of memory mapping that we are trying to support. The plugin is being extended to support this new type.

rerrabolu commented 10 months ago

What is not clear is who is creating these files "/dmabuf". I will update the ticket if we see these files when restoring a process tree that has only one process.

rerrabolu commented 10 months ago

Need some pointers on running "restore" via gdb.

Try to run "catch fork" and "set follow-fork-mode child" does not help. The only breakpoint I am able to hit (amdgpu_plugin_resume_devices_late) is one that is kind late. I am unable to hit the call to amdgpu_plugin_restore_file

dayatsin-amd commented 10 months ago

I think these /dmabuf files are created when you use libdrm to create memory mappings.

From what I remember, when restoring open files, at some point CRIU will reach this line:

tmp = fcntl(old_fd, F_DUPFD, new_fd);

https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/util.c#L238C4-L238C4

And this will re-adjust the value of the file-descriptor to match the values during checkpoint time. It looks like you are not reaching this section when restoring the /dmabuf files, so the /dmabuf files have values 0,1,2 on restore, instead of 5,6,7

rerrabolu commented 10 months ago

I am not sure how is creating the /dmabuf files.

Including below log from a workload that Passes

I introduced a sleep() call towards the end of dump_file() and the FDs of the process don't show any files that are /dmabuf kind

Attaching the log of the same: No-Dmabuf-Files-Passing-Workload.txt

rerrabolu commented 10 months ago

I checked if /dmabuf files show up during restore for some actions taken by the driver. To the extent I could test I don't see them.

Including below log from a workload that Passes

I introduced a sleep() call towards the end of restore_file() and the FDs of the process don't show any files that are /dmabuf kind

Attaching the log of the same: No-Dmabuf-Files-Restore-Passing-Workload.txt

rerrabolu commented 10 months ago

High-level design of the extension of amdgpu_plugin to support amdgpu_bo memory objects

Assume the system has 2 GPU devices. amdgpu_plugin is expected to support checkpointing and restoring of following device file handles /dev/kfd, /dev/dri/renderD128 and /dev/dri/renderD129.

Design prior to the Extension:

With the Extension:

Delta in the Design:

In the old design, upon dumping /dev/kfd file the plugin api amdgpu_plugin_dump_file() will invoke unpause() process. In the new design plugin's implementation will cache the FD of /dev/kfd and invoke unpause() only after /dev/dri/renderD* files have been dumped. I am assuming this is not a problem (checkpointing of a process tree with two or more processes is successful)

Pass/Fail History:

I have two workloads that PASS when the process tree has only one process

Workload 1:

  • Allocate BO on device D1 using KFD Ioctl
  • Import BO (memory) on device D1 using DRM Ioctl
  • Attach VA ranges on imported memory using DRM Ioctl

Workload 2:

  • Allocate BO on device D1 using KFD Ioctl
  • Import BO (memory) on device D2 using DRM Ioctl
  • Attach VA ranges on imported memory using DRM Ioctl

The workload that is FAILING has two processes. Process P1 (parent) creates the BOs using KFD Ioctl and lets process P2 (child) import them into a DRM device

github-actions[bot] commented 9 months ago

A friendly reminder that this issue had no activity for 30 days.

rerrabolu commented 9 months ago

The issue can be closed. A fix in the driver software (closing the Dmabuf FD) seems to solve this issue. The error message threw me off the place I was looking