checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.76k stars 559 forks source link

Checkpoint of a new restored container fails #2379

Closed obsidian0215 closed 3 months ago

obsidian0215 commented 3 months ago

Description

Podman/CRIU fails to checkpoint a container restored using --import and --name. (similar to https://github.com/containers/podman/issues/13672) How can I checkpoint the new container?

Steps to reproduce the issue:

  1. Create container podman run -d --name looper busybox /bin/sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'
  2. Checkpoint container with --export podman container checkpoint --export ch1.tar.gz looper
  3. Restore container checkpoint with --import and --name podman container restore --import ch1.tar.gz --name looper2
  4. Checkpoint the new container podman container checkpoint looper2 --export ch2.tar.gz

Describe the results you received:

ERRO[0000] container is not destroyed                   
ERRO[0000] criu failed: type NOTIFY errno 0
log file: /var/lib/containers/storage/overlay-containers/868a180ab534c95b938e6cdc481f0df0ee6032ca47399deb5d458fea2628407d/userdata/dump.log 
Error: `/usr/bin/runc checkpoint --image-path /var/lib/containers/storage/overlay-containers/868a180ab534c95b938e6cdc481f0df0ee6032ca47399deb5d458fea2628407d/userdata/checkpoint --work-path /var/lib/containers/storage/overlay-containers/868a180ab534c95b938e6cdc481f0df0ee6032ca47399deb5d458fea2628407d/userdata 868a180ab534c95b938e6cdc481f0df0ee6032ca47399deb5d458fea2628407d` failed: exit status 1

Describe the results you expected: It's expected that the looper2 container creates a new checkpoint in ch2.tar.gz.

Additional information you deem important (e.g. issue happens only occasionally): output of podman version:

Client:       Podman Engine
Version:      4.3.1
API Version:  4.3.1
Go Version:   go1.19.8
Built:        Thu Jan 1 08:00:00 1970
OS/Arch:      linux/amd64

output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.28.2
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon_2.1.6+ds1-1_amd64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.6, commit: unknown'
  cpuUtilization:
    idlePercent: 18.33
    systemPercent: 9.95
    userPercent: 71.73
  cpus: 8
  distribution:
    codename: bookworm
    distribution: debian
    version: "12"
  eventLogger: journald
  hostname: debian-obsidian
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 6.1.0-18-amd64
  linkmode: dynamic
  logDriver: journald
  memFree: 433684480
  memTotal: 8290725888
  networkBackend: netavark
  ociRuntime:
    name: runc
    package: Unknown
    path: /usr/sbin/runc
    version: |-
      runc version 1.1.12
      commit: v1.1.12-0-g51d5e946
      spec: 1.0.2-dev
      go: go1.20.13
      libseccomp: 2.5.4
  os: linux
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: true
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns_1.2.0-1_amd64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.4
  swapFree: 1005469696
  swapTotal: 1022357504
  uptime: 1h 27m 32.00s (Approximately 0.04 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries: {}
store:
  configFile: /usr/share/containers/storage.conf
  containerStore:
    number: 2
    paused: 0
    running: 0
    stopped: 2
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 40947412992
  graphRootUsed: 14612869120
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 1
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.3.1
  Built: 0
  BuiltTime: Thu Jan  1 08:00:00 1970
  GitCommit: ""
  GoVersion: go1.19.8
  Os: linux
  OsArch: linux/amd64
  Version: 4.3.1

output of uname -a:

Linux debian-obsidian 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux

CRIU logs and information:

CRIU full dump/restore logs:

``` (00.000000) Unable to get $HOME directory, local configuration file will not be used. (00.000136) Version: 3.17.1 (gitid 0) (00.000154) Running on debian-obsidian Linux 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 (00.000160) Would overwrite RPC settings with values from /etc/criu/runc.conf (00.000208) Loaded kdat cache from /run/criu.kdat (00.000281) Hugetlb size 2 Mb is supported but cannot get dev's number (00.000329) Hugetlb size 1024 Mb is supported but cannot get dev's number (00.000456) ======================================== (00.000468) Dumping processes (pid: 42253) (00.000473) ======================================== (00.000499) rlimit: RLIMIT_NOFILE unlimited for self (00.000523) Running pre-dump scripts (00.000529) RPC (00.001023) irmap: Searching irmap cache in work dir (00.001263) No irmap-cache image (00.001277) irmap: Searching irmap cache in parent (00.001296) No parent images directory provided (00.001303) irmap: No irmap cache (00.001352) cpu: x86_family 6 x86_vendor_id GenuineIntel x86_model_id 13th Gen Intel(R) Core(TM) i7-1370P (00.001369) cpu: fpu: xfeatures_mask 0x205 xsave_size 2696 xsave_size_max 2696 xsaves_size 840 (00.001401) cpu: fpu: x87 floating point registers xstate_offsets 0 / 0 xstate_sizes 160 / 160 (00.001409) cpu: fpu: AVX registers xstate_offsets 576 / 576 xstate_sizes 256 / 256 (00.001415) cpu: fpu: Protection Keys User registers xstate_offsets 2688 / 832 xstate_sizes 8 / 8 (00.001421) cpu: fpu:1 fxsr:1 xsave:1 xsaveopt:1 xsavec:1 xgetbv1:1 xsaves:1 (00.001789) cg-prop: Parsing controller "cpu" (00.001806) cg-prop: Strategy "replace" (00.001815) cg-prop: Property "cpu.shares" (00.001820) cg-prop: Property "cpu.cfs_period_us" (00.001826) cg-prop: Property "cpu.cfs_quota_us" (00.001831) cg-prop: Property "cpu.rt_period_us" (00.001835) cg-prop: Property "cpu.rt_runtime_us" (00.001840) cg-prop: Parsing controller "memory" (00.001845) cg-prop: Strategy "replace" (00.001849) cg-prop: Property "memory.limit_in_bytes" (00.001854) cg-prop: Property "memory.memsw.limit_in_bytes" (00.001858) cg-prop: Property "memory.swappiness" (00.001863) cg-prop: Property "memory.soft_limit_in_bytes" (00.001867) cg-prop: Property "memory.move_charge_at_immigrate" (00.001872) cg-prop: Property "memory.oom_control" (00.001876) cg-prop: Property "memory.use_hierarchy" (00.001880) cg-prop: Property "memory.kmem.limit_in_bytes" (00.001885) cg-prop: Property "memory.kmem.tcp.limit_in_bytes" (00.001889) cg-prop: Parsing controller "cpuset" (00.001894) cg-prop: Strategy "replace" (00.001899) cg-prop: Property "cpuset.cpus" (00.001903) cg-prop: Property "cpuset.mems" (00.001907) cg-prop: Property "cpuset.memory_migrate" (00.001912) cg-prop: Property "cpuset.cpu_exclusive" (00.001916) cg-prop: Property "cpuset.mem_exclusive" (00.001920) cg-prop: Property "cpuset.mem_hardwall" (00.001925) cg-prop: Property "cpuset.memory_spread_page" (00.001929) cg-prop: Property "cpuset.memory_spread_slab" (00.001934) cg-prop: Property "cpuset.sched_load_balance" (00.001938) cg-prop: Property "cpuset.sched_relax_domain_level" (00.001943) cg-prop: Parsing controller "blkio" (00.001947) cg-prop: Strategy "replace" (00.001952) cg-prop: Property "blkio.weight" (00.001957) cg-prop: Parsing controller "freezer" (00.001961) cg-prop: Strategy "replace" (00.001966) cg-prop: Parsing controller "perf_event" (00.001970) cg-prop: Strategy "replace" (00.001975) cg-prop: Parsing controller "net_cls" (00.001980) cg-prop: Strategy "replace" (00.001984) cg-prop: Property "net_cls.classid" (00.001988) cg-prop: Parsing controller "net_prio" (00.001993) cg-prop: Strategy "replace" (00.001998) cg-prop: Property "net_prio.ifpriomap" (00.002002) cg-prop: Parsing controller "pids" (00.002007) cg-prop: Strategy "replace" (00.002011) cg-prop: Property "pids.max" (00.002015) cg-prop: Parsing controller "devices" (00.002020) cg-prop: Strategy "replace" (00.002024) cg-prop: Property "devices.list" (00.002106) Preparing image inventory (version 1) (00.002232) Add pid ns 1 pid 42312 (00.002262) Add net ns 2 pid 42312 (00.002285) Add ipc ns 3 pid 42312 (00.002307) Add uts ns 4 pid 42312 (00.002328) Add time ns 5 pid 42312 (00.002358) Add mnt ns 6 pid 42312 (00.002386) Add user ns 7 pid 42312 (00.002414) Add cgroup ns 8 pid 42312 (00.002421) cg: Dumping cgroups for 42312 (00.002459) cg: `- New css ID 1 (00.002465) cg: `- [] -> [/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-3717053b-cf26-4520-9ad4-9884aa06de13.scope] [0] (00.002471) cg: Set 1 is criu one (00.002534) Error (criu/seize.c:911): Neither a cgroupv1 (freezer.state) or cgroupv2 (cgroup.freeze) control file found. (00.002576) Unlock network (00.002596) Unfreezing tasks into 1 (00.002607) Unseizing 42253 into 1 (00.002621) Error (compel/src/lib/infect.c:356): Unable to detach from 42253: No such process (00.002642) Error (criu/cr-dump.c:2053): Dumping FAILED. ```

Output of `criu --version`:

``` Version: 3.17.1 ```

Output of `criu check --all`:

``` Looks good. ``` **output of `criu check --all`**: (**criu 3.19**) ``` Looks good but some kernel features are missing which, depending on your process tree, may cause dump or restore failure. ``` (it did't print the detail of missing feature??)

Additional environment details:

output of kernel config: kernel-config.txt

adrianreber commented 3 months ago

Works for me with Podman 4.9.3 and CRIU 3.19 on Fedora with cgroup v1.

There is a patch for cgroup v2 in runc which has not made it to one of the releases yet which might be necessary for a v2 system. (https://github.com/opencontainers/runc/pull/3546)

obsidian0215 commented 3 months ago

I tried and it worked well. criu 3.18+podman 3.4.1+runc 1.1.12(with cgroup v1) image

So now the alternative is to use cgroup v1.(Will runc 1.2.0 release the patch for v2?)

adrianreber commented 3 months ago

Will runc 1.2.0 release the patch for v2?

I don't know.

If your problem is solved, please close the ticket.

obsidian0215 commented 3 months ago

OK, thank you.