checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.99k stars 599 forks source link

criu does not restore process start time #2504

Open hanwen-flow opened 3 weeks ago

hanwen-flow commented 3 weeks ago

Description

The process start time (entry 22 in /proc/$PID/stat) is not restored faithfully.

This is a problem, b/c the software I'm trying to checkpoint/restore has a client that uses the pid + start time to check if it the server wasn't changed from under it.

$ docker run -d --name looper ubuntu:latest /bin/bash -c 'i=0; p=$BASHPID; while true; do echo -n  "$i "; cat /proc/$p/stat| awk "{print \$22;}"; i=$(expr $i + 1); sleep 1; done'
562b85dcf5cf086495b8f39ee1c18e88083f4c48327d568fad88d09d3f059040

$ docker logs looper
0 138623
1 138623
2 138623
3 138623
4 138623
5 138623

$ docker checkpoint create looper cp1
cp1
$ docker container start --checkpoint cp1  looper
$ docker logs looper
...
14 138623
15 138623
16 138623
17 138623
18 140372
19 140372
20 140372
21 140372
22 140372

version info:

$ docker version
Client: Docker Engine - Community
 Version:           27.3.1
 API version:       1.47
 Go version:        go1.22.7
 Git commit:        ce12230
 Built:             Fri Sep 20 11:41:00 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          27.3.1
  API version:      1.47 (minimum version 1.24)
  Go version:       go1.22.7
  Git commit:       41ca978
  Built:            Fri Sep 20 11:41:00 2024
  OS/Arch:          linux/amd64
  Experimental:     true
 containerd:
  Version:          1.7.22
  GitCommit:        7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
 runc:
  Version:          1.1.14
  GitCommit:        v1.1.14-0-g2c9f560
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

$ criu --version
Version: 3.18
GitID: v3.18-320-gdfb56eed6

$ uname -a 
Linux hanwen-flow 6.8.0-47-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Oct  2 16:16:55 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

$ sudo criu check --all
sudo: mon_handle_sigchld: waitpid: No child processes
Looks good.
$ 
avagin commented 3 weeks ago

It is a known issue. Someone has to introduce a kernel interface for that.

avagin commented 3 weeks ago

@kolyshkin wrote a fun article about that a few years ago: https://medium.com/@kolyshkin/oracle-in-a-docker-container-checkpoint-restore-debug-fun-dda98b7302ed

Snorch commented 2 weeks ago

We can try doing same as it is done in OpenVZ

https://github.com/OpenVZ/vzkernel/commit/2602bde7a34fff55753e8568527c2260c4939c30 https://github.com/OpenVZ/criu/commit/df8fec2c9250f6c3330eb529eca39d7d2ca5219b

Sadly it relies heavily on "ve" container object which is OpenVZ specific, but probably it would be not so hard to rework it to time-namespace based approach instead.