checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.77k stars 561 forks source link

criu checkpoint contianer on containerd failed #2310

Closed loheagn closed 7 months ago

loheagn commented 7 months ago

Description

Steps to reproduce the issue:

I followed the instructions described in this page:

  1. use ctr image pull docker.io/library/redis:alpine to pull image
  2. run the container ctr run -d docker.io/library/redis:alpine redis
  3. checkpoint the container using criu ctr c checkpoint --rw --task redis checkpoint/redis:cr-1

Describe the results you received:

checkpoint failed, output error:

ctr: runc did not terminate successfully: exit status 1: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v2.task/default/redis/criu-dump.log: unknown

Describe the results you expected:

checkpoint should work well.

Additional information you deem important (e.g. issue happens only occasionally):

CRIU logs and information:

CRIU full dump/restore logs:

``` ...... (00.011348) Dumping core (pid: 463959) (00.011349) ---------------------------------------- (00.011350) Obtaining personality ... (00.011363) Sent msg to daemon 64 0 0 (00.011364) Wait for ack 64 on daemon socket pie: 1: __fetched msg: 64 0 0 pie: 1: __sent ack msg: 64 64 0 pie: 1: Daemon waits for command (00.011388) Fetched ack: 64 64 0 (00.011402) 463959 has 0 sched policy (00.011404) dumping 0 nice for 463959 (00.011407) dumping /proc/463959/loginuid (00.011415) dumping /proc/463959/oom_score_adj (00.011427) Sent msg to daemon 76 0 0 (00.011428) Wait for ack 76 on daemon socket pie: 1: __fetched msg: 76 0 0 pie: 1: __sent ack msg: 76 76 0 pie: 1: Daemon waits for command (00.011450) Fetched ack: 76 76 0 (00.011453) cg: Dumping cgroups for 463959 (00.011473) cg: `- New css ID 2 (00.011475) cg: `- [] -> [/default/redis] [0] (00.011501) cg: adding cgroup /proc/self/fd/20/default/redis (00.011508) cg: Couldn't open /proc/self/fd/20/default/redis/cgroup.clone_children. This cgroup property may not exist on this kernel (00.011514) cg: Couldn't open /proc/self/fd/20/default/redis/notify_on_release. This cgroup property may not exist on this kernel (00.011522) cg: Dumping value from /proc/self/fd/20/default/redis/cgroup.procs (00.011525) cg: Couldn't open /proc/self/fd/20/default/redis/tasks. This cgroup property may not exist on this kernel (00.011593) cg: Set 2 is root one (00.011640) ---------------------------------------- (00.011644) Waiting for 463959 to trap (00.011652) Daemon 463959 exited trapping (00.011656) Sent msg to daemon 3 0 0 (00.011660) Force no-breakpoints restore (00.011670) 463959 was trapped (00.011672) 463959 (native) is going to execute the syscall 45, required is 15 (00.011680) 463959 was trapped (00.011681) `- Expecting exit (00.011688) 463959 was trapped (00.011690) 463959 (native) is going to execute the syscall 186, required is 15 (00.011697) 463959 was trapped (00.011698) `- Expecting exit (00.011705) 463959 was trapped (00.011707) 463959 (native) is going to execute the syscall 1, required is 15 pie: 1: __fetched msg: 3 0 0 (00.011715) 463959 was trapped (00.011717) `- Expecting exit (00.011725) 463959 was trapped (00.011727) 463959 (native) is going to execute the syscall 186, required is 15 (00.011733) 463959 was trapped (00.011734) `- Expecting exit (00.011741) 463959 was trapped (00.011742) 463959 (native) is going to execute the syscall 186, required is 15 (00.011749) 463959 was trapped (00.011750) `- Expecting exit (00.011758) 463959 was trapped (00.011759) 463959 (native) is going to execute the syscall 1, required is 15 pie: 1: 1: new_sp=0x7f3714ad1fc8 ip 0x7f371507bf63 (00.011767) 463959 was trapped (00.011769) `- Expecting exit (00.011775) 463959 was trapped (00.011777) 463959 (native) is going to execute the syscall 3, required is 15 (00.011787) 463959 was trapped (00.011788) `- Expecting exit (00.011795) 463959 was trapped (00.011797) 463959 (native) is going to execute the syscall 3, required is 15 (00.011804) 463959 was trapped (00.011805) `- Expecting exit (00.011812) 463959 was trapped (00.011814) 463959 (native) is going to execute the syscall 15, required is 15 (00.011821) 463959 was stopped (00.011824) (00.011825) Dumping core for thread (pid: 463985) (00.011826) ---------------------------------------- (00.011827) Dumping general registers for 463985 in native mode (00.011833) Dumping GP/FPU registers for 463985 (00.011874) Restoring GP/FPU registers for 463985 (00.011875) Error (compel/arch/x86/src/lib/infect.c:401): Can't set FPU registers for 463985: Bad address (00.011880) Error (compel/src/lib/infect.c:1471): Parasite exited with -1 (00.011882) Error (criu/parasite-syscall.c:202): Can't init thread in parasite 463985 (00.011884) Error (criu/cr-dump.c:854): Can't dump thread for pid 463985 (00.011885) ---------------------------------------- (00.011886) Error (criu/cr-dump.c:1412): Can't dump threads (00.011986) Unlock network (00.011988) Running network-unlock scripts (00.011989) RPC (00.057078) Unfreezing tasks into 1 (00.057093) Unseizing 463959 into 1 (00.057141) Error (criu/cr-dump.c:1781): Dumping FAILED. ```

Output of `criu --version`:

``` Version: 3.16.1 ```

Output of `criu check --all`:

``` Looks good. ```

Additional environment details:

The containerd version, run ctr version:

Client:
  Version:  v1.7.10
  Revision: 4e1fe7492b9df85914c389d1f15a3ceedbb280ac
  Go version: go1.21.4

Server:
  Version:  v1.7.10
  Revision: 4e1fe7492b9df85914c389d1f15a3ceedbb280ac
  UUID: 66b55944-9270-456a-82a5-83004e9418ce

runc version:

runc version 1.1.10
commit: v1.1.10-0-g18a0cb0f
spec: 1.0.2-dev
go: go1.21.4
libseccomp: 2.5.1
loheagn commented 7 months ago

I upgrade the criu verison to 3.18 and all work well.

$ criu --version
Version: 3.18
GitID: v3.18-183-g0da1ab257

Keep this issue for people who counter the same issue.

adrianreber commented 7 months ago

Looking at the parts of the log file you shared I see:

(00.011825) Dumping core for thread (pid: 463985)
(00.011826) ----------------------------------------
(00.011827) Dumping general registers for 463985 in native mode
(00.011833) Dumping GP/FPU registers for 463985
(00.011874) Restoring GP/FPU registers for 463985
(00.011875) Error (compel/arch/x86/src/lib/infect.c:401): Can't set FPU registers for 463985: Bad address
(00.011880) Error (compel/src/lib/infect.c:1471): Parasite exited with -1
(00.011882) Error (criu/parasite-syscall.c:202): Can't init thread in parasite 463985
(00.011884) Error (criu/cr-dump.c:854): Can't dump thread for pid 463985
(00.011885) ----------------------------------------
(00.011886) Error (criu/cr-dump.c:1412): Can't dump threads

Can you share the complete log so that we can see the CPU details. Not sure if this is related, but the latest version of CRIU (3.19) has a fix for newer Intel CPU with larger xsave areas. Maybe that is related. Please try with 3.19 and please include the complete log file.

CC: @0x7f454c46 as git says that you authored the code

adrianreber commented 7 months ago

I upgrade the criu verison to 3.18 and all work well.

I just suggested that. :wink:

loheagn commented 7 months ago

@adrianreber Yes! Thanks so much for the reply :)