Closed rerrabolu closed 9 months ago
The file at the root of the error trigger is said to be: tty[8801:17]
Could you attach the full criu log? It should help to figure out the issue.
I figured out where the problem is. I think we need to do something like: https://github.com/avagin/criu/commit/31b4997b628e4a5556332fd12a5b49864e52aab1
Unfortunately, I don't have hardware to test the patch. @fdavid-amd and @dayatsin-amd could you review and finalize the change? It is probably requires some minor fixes.
Please find attached logs from the run that end up on "restore" failing
Kfdtest-ProcMaps Kfdtest-ProcMaps.txt
ram_criu_dump.log ram_criu_dump.log
ram_criu_restore.log ram_criu_restore.log
Let me know if the logs have enough information. I will take a look at the patch provided and see if I can try it out.
Thanks for the patch @avagin
From what I remember there was something we had to do to restore the tty devices with the --inherit-fd option: https://criu.org/Inheriting_FDs_on_restore
@rerrabolu is also from AMD, will sync-up with up with him
@dayatsin-amd I looked at find_unused_fd one more time and I think I was trying to fix unrelated issue. We need to fix it too, but I don't think that it is triggered here. This issue looks more like a leaked file descriptor.
In the attached log, I found another error that looks suspicions:
(00.143629) 5116: plugin: `amdgpu_plugin' hook 3 -> 0x7f63b3970e91
(00.143633) 5116: amdgpu_plugin: Sairam: amdgpu_plugin_restore_file(), ID is: 29 Trying to open file: amdgpu-kfd-29.img
(00.143686) 5116: Error (amdgpu_plugin_util.c:116): amdgpu-kfd-29.img: Failed to open for read: No such file or directory
(00.143696) 5116: Sairam: amdgpu_plugin_drm_restore_file() Tid: 5116 Enter the PLUGIN FUNCTION with DRM File ID: 29
(00.143706) 5116: Sairam: amdgpu_plugin_drm_restore_file(), Restoring RenderD amdgpu-drm-29.img
(00.143726) 5116: amdgpu-drm-29.img:Opened file for read with size:133
(00.143730) 5116: RenderD Image file size:133
(00.143742) 5116: Sairam: amdgpu_plugin_drm_restore_file(), FD sent to deserialize_device_state() is: 13
(00.143747) 5116: Sairam: deserialize_device_state(), Number of BOs from Checkpoint: 2
(00.143751) 5116:
It seems a non-fatal error but you need to check that all file descriptors are closed properly. btw: I can't find amdgpu_plugin_drm_restore_file in the upstream criu.
The error "(00.143686) 5116: Error (amdgpu_plugin_util.c:116): amdgpu-kfd-29.img: Failed to open for read: No such file or directory" can be ignored. It is a local error from trying to open a file whose name is constructed based on "id" passed into restore_file() method.
@rerrabolu could collect a strace log for criu restore?
You will need to run criu like this:
$ strace -fo strace.log -s 1024 criu restore ....
and then attach strace.log here.
Attaching relevant logs as requested:
Dump Command: sudo criu dump -t "$pidVal" -j -vvv -o dump.log --ext-unix-sk --link-remap && echo "Checkpoint OK"
Restore Command: sudo strace -fo strace.log -s 1024 criu restore -vvv -o restore.log --shell-job --link-remap && echo "Restore OK"
Dump Log: ram_criu_dump_setsid_no_r4.log
Strace Log: strace.log
Log with Stderr: ram_criu_restore_inherit_fd-no_r4.log
In addition I tried running the workload using setsid and inherit-fd option without luck
Launch Workload: setsid - - -
Restore Command: sudo criu restore -vvv -o restore.log --shell-job --link-remap --inherit-fd 'fd[1]:tty[8801:17]' && echo "Restore OK"
@rerrabolu oh, criu under strace has been broken. It failed in kerndat_has_ptrace_get_rseq_conf. Sorry for inconvenience. Can I ask you to run "criu restore" under "perf trace". It is similar with strace, but it doesn't depend on ptrace.
$ perf trace -o perf.log --syscalls criu restore ...
Looks like "perf" needs linux-tools package. This will be a problem because the kernel build that has my driver changes is based on 6.0. and is a monolithic build. The apt repository does not show me a package for 6.0 that I could install for Ubuntu 20.04.
I am trying to install 5.15.0-46 and see if it helps. Will update if it works, doubt it
I am going to print more debug data on the file/fd triggering the error. But what if any I can do with this information is not clear.
@rerrabolu I think a root cause of this issue is that we forgot to close one of file descriptors. Then when we started restoring process file descriptors, we found that the 0 fd is already occupied.
You can try to stop criu right after the error:
diff --git a/criu/util.c b/criu/util.c
index db96cf938..c8195a376 100644
--- a/criu/util.c
+++ b/criu/util.c
@@ -245,6 +245,7 @@ int reopen_fd_as_safe(char *file, int line, int new_fd, int old_fd, bool allow_r
} else if (tmp != new_fd) {
close(tmp);
pr_err("fd %d already in use (called at %s:%d)\n", new_fd, file, line);
+ sleep(100);
return -1;
}
and then you will need to look at /proc/pid/fd of the target process to find out what file uses the 0 fd.
btw: I think I fixed the strace issue by https://github.com/checkpoint-restore/criu/pull/2273.
Attaching log from CRIU framework showing relevant FDs that are involved in triggering failure. OpenFD-Restore.txt
It is the /dmabuf files in the child process: lrwx------ 1 root root 64 Sep 30 01:16 0 -> /dmabuf:
The is new type of memory mapping that we are trying to support. The plugin is being extended to support this new type.
What is not clear is who is creating these files "/dmabuf". I will update the ticket if we see these files when restoring a process tree that has only one process.
Need some pointers on running "restore" via gdb.
Try to run "catch fork" and "set follow-fork-mode child" does not help. The only breakpoint I am able to hit (amdgpu_plugin_resume_devices_late) is one that is kind late. I am unable to hit the call to amdgpu_plugin_restore_file
I think these /dmabuf files are created when you use libdrm to create memory mappings.
From what I remember, when restoring open files, at some point CRIU will reach this line:
tmp = fcntl(old_fd, F_DUPFD, new_fd);
https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/util.c#L238C4-L238C4
And this will re-adjust the value of the file-descriptor to match the values during checkpoint time. It looks like you are not reaching this section when restoring the /dmabuf files, so the /dmabuf files have values 0,1,2 on restore, instead of 5,6,7
I am not sure how is creating the /dmabuf files.
Including below log from a workload that Passes
I introduced a sleep() call towards the end of dump_file() and the FDs of the process don't show any files that are /dmabuf kind
Attaching the log of the same: No-Dmabuf-Files-Passing-Workload.txt
I checked if /dmabuf files show up during restore for some actions taken by the driver. To the extent I could test I don't see them.
Including below log from a workload that Passes
I introduced a sleep() call towards the end of restore_file() and the FDs of the process don't show any files that are /dmabuf kind
Attaching the log of the same: No-Dmabuf-Files-Restore-Passing-Workload.txt
High-level design of the extension of amdgpu_plugin to support amdgpu_bo memory objects
Assume the system has 2 GPU devices. amdgpu_plugin is expected to support checkpointing and restoring of following device file handles /dev/kfd, /dev/dri/renderD128 and /dev/dri/renderD129.
Design prior to the Extension:
With the Extension:
Delta in the Design:
In the old design, upon dumping /dev/kfd file the plugin api amdgpu_plugin_dump_file() will invoke unpause() process. In the new design plugin's implementation will cache the FD of /dev/kfd and invoke unpause() only after /dev/dri/renderD* files have been dumped. I am assuming this is not a problem (checkpointing of a process tree with two or more processes is successful)
Pass/Fail History:
I have two workloads that PASS when the process tree has only one process
Workload 1:
- Allocate BO on device D1 using KFD Ioctl
- Import BO (memory) on device D1 using DRM Ioctl
- Attach VA ranges on imported memory using DRM Ioctl
Workload 2:
- Allocate BO on device D1 using KFD Ioctl
- Import BO (memory) on device D2 using DRM Ioctl
- Attach VA ranges on imported memory using DRM Ioctl
The workload that is FAILING has two processes. Process P1 (parent) creates the BOs using KFD Ioctl and lets process P2 (child) import them into a DRM device
A friendly reminder that this issue had no activity for 30 days.
The issue can be closed. A fix in the driver software (closing the Dmabuf FD) seems to solve this issue. The error message threw me off the place I was looking
master@hpe-sjc2-03:Tmp$ criu --version Version: 3.18 GitID: v3.18-143-g4c845c885
Restore Cmd: sudo criu restore -vvv -o restore.log --shell-job --link-remap && echo "Restore OK"
The restore is applied to a two process tree involving a GPU (amdgpu plugin). Each process generates 3 IDs for the various VMA associated with each process. The restore goes through the restoration of each ID file. Criu framework encounters an error as given below before invoking the two processes to "resume'
(01.355179) 11592: Further fle=0x7f48c9bdc748, pid=11592 (01.355185) 11592: Error (criu/util.c:246): fd 0 already in use (called at criu/files.c:992) (01.355189) 11592: Error (criu/files.c:1177): Can't get fd=0, pid=11592 (01.355193) 11592: Error (criu/files.c:1215): Unable to open fd=0 id=0x11 (01.375880) 11591: Error (criu/cr-restore.c:1513): 11592 exited, status=1 (01.375984) Error (criu/cr-restore.c:2557): Restoring FAILED.
This error does not occur on another two process workload that generates only one ID file per process. The ID file refers to KFD device. In the failing case the two extra files refer to DRM devices.
Any suggestions on how to understand / debug this issue is appreciated