checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.88k stars 583 forks source link

Send-Q isn't 0 after restore #1910

Open mrc1119 opened 2 years ago

mrc1119 commented 2 years ago

Description When I recovered a Redis master shard, it got stuck while synchronizing data with the slave shard. There is no problem in the restore process, but when comparing with the normally started Redis cluster, it is found to be stuck, so I suspect that the problem is caused by CRIU, please help to check whether the problem is related to CRIU.

Steps to reproduce the issue: 1.checkpoint: ./criu dump --ghost-limit=500M -D /export/sine/dumpdata -j -R -t $pid --tcp-close --skip-in-flight --external mnt[/etc/hosts]:hostsmount --external mnt[/etc/hostname]:hostnamemount --external mnt[/etc/resolv.conf]:reslovemount --external mnt[/var/log/mydb]:mydbmount -v4 -o /export/sine/sinedump.log

2.restore: ./criu restore --manage-cgroup=ignore --ghost-limit=500M -D /export/sine/dumpdata -j -d --tcp-close --root /export/Data/docker/vfs/dir/${containerid} --external mnt[hostsmount]:/etc/hosts --external mnt[hostnamemount]:/etc/hostname --external mnt[reslovemount]:/etc/resolv.conf --external mnt[mydbmount]:/var/log/mydb -v4 -o /export/sine/sinerestore.log

3.Master-slave synchronization: Data is transmitted through TCP

Describe the results you received: Data synchronization failed, "11.22.33.44" Indicates the IP address of the slave shard.

Additional information you deem important (e.g. issue happens only occasionally): I used Strace to observe the data synchronization process.

strace logs:

``` 13:24:21.663141 epoll_wait(4, [{EPOLLIN, {u32=7, u64=7}}], 10128, 100) = 1 13:24:21.718779 accept(7, {sa_family=AF_INET, sin_port=htons(11510), sin_addr=inet_addr("11.22.33.44")}, [16]) = 11 13:24:21.718864 fcntl(11, F_GETFL) = 0x2 (flags O_RDWR) 13:24:21.718944 fcntl(11, F_SETFL, O_RDWR|O_NONBLOCK) = 0 13:24:21.719011 setsockopt(11, SOL_TCP, TCP_NODELAY, [1], 4) = 0 13:24:21.719071 epoll_ctl(4, EPOLL_CTL_ADD, 11, {EPOLLIN, {u32=11, u64=11}}) = 0 13:24:21.719132 getpeername(11, {sa_family=AF_INET, sin_port=htons(11510), sin_addr=inet_addr("11.22.33.44")}, [16]) = 0 13:24:21.719194 accept(7, 0x7ffe892470c0, 0x7ffe892470bc) = -1 EAGAIN (Resource temporarily unavailable) 13:24:21.719295 epoll_wait(4, [{EPOLLIN, {u32=11, u64=11}}], 10128, 44) = 1 13:24:21.719373 read(11, "Ping\r\n", 16384) = 6 13:24:21.719438 epoll_ctl(4, EPOLL_CTL_MOD, 11, {EPOLLIN|EPOLLOUT, {u32=11, u64=11}}) = 0 13:24:21.719496 epoll_wait(4, [{EPOLLOUT, {u32=11, u64=11}}], 10128, 44) = 1 13:24:21.719552 write(11, "-NOAUTH Authentication required."..., 34) = 34 13:24:21.719626 epoll_ctl(4, EPOLL_CTL_MOD, 11, {EPOLLIN, {u32=11, u64=11}}) = 0 13:24:21.719682 epoll_wait(4, [{EPOLLIN, {u32=11, u64=11}}], 10128, 44) = 1 13:24:21.719809 read(11, "Auth 2822712401687866775\r\n", 16384) = 26 13:24:21.719874 epoll_ctl(4, EPOLL_CTL_MOD, 11, {EPOLLIN|EPOLLOUT, {u32=11, u64=11}}) = 0 13:24:21.719932 epoll_wait(4, [{EPOLLOUT, {u32=11, u64=11}}], 10128, 44) = 1 13:24:21.719987 write(11, "+OK\r\n", 5) = 5 13:24:21.720055 epoll_ctl(4, EPOLL_CTL_MOD, 11, {EPOLLIN, {u32=11, u64=11}}) = 0 13:24:21.720110 epoll_wait(4, [{EPOLLIN, {u32=11, u64=11}}], 10128, 43) = 1 13:24:21.720220 read(11, "Slots info\r\n", 16384) = 12 13:24:21.720288 epoll_ctl(4, EPOLL_CTL_MOD, 11, {EPOLLIN|EPOLLOUT, {u32=11, u64=11}}) = 0 13:24:21.720345 epoll_wait(4, [{EPOLLOUT, {u32=11, u64=11}}], 10128, 43) = 1 13:24:21.720400 write(11, "$15\r\nslot 8192-16383\r\n", 22) = 22 13:24:21.720467 epoll_ctl(4, EPOLL_CTL_MOD, 11, {EPOLLIN, {u32=11, u64=11}}) = 0 13:24:21.720522 epoll_wait(4, [{EPOLLIN, {u32=11, u64=11}}], 10128, 43) = 1 13:24:21.721125 read(11, "ReplConf listening-port 6591 ver"..., 16384) = 50 13:24:21.721195 epoll_ctl(4, EPOLL_CTL_MOD, 11, {EPOLLIN|EPOLLOUT, {u32=11, u64=11}}) = 0 13:24:21.721264 epoll_wait(4, [{EPOLLOUT, {u32=11, u64=11}}], 10128, 42) = 1 13:24:21.721319 write(11, "+OK\r\n", 5) = 5 13:24:21.721387 epoll_ctl(4, EPOLL_CTL_MOD, 11, {EPOLLIN, {u32=11, u64=11}}) = 0 13:24:21.721442 epoll_wait(4, [{EPOLLIN, {u32=11, u64=11}}], 10128, 42) = 1 13:24:21.721553 read(11, "SyncProto\r\n", 16384) = 11 13:24:21.721614 epoll_ctl(4, EPOLL_CTL_MOD, 11, {EPOLLIN|EPOLLOUT, {u32=11, u64=11}}) = 0 13:24:21.721670 epoll_wait(4, [{EPOLLOUT, {u32=11, u64=11}}], 10128, 42) = 1 13:24:21.721725 write(11, "+OK\r\n", 5) = 5 13:24:21.721793 epoll_ctl(4, EPOLL_CTL_MOD, 11, {EPOLLIN, {u32=11, u64=11}}) = 0 13:24:21.721849 epoll_wait(4, [{EPOLLIN, {u32=11, u64=11}}], 10128, 42) = 1 13:24:21.721953 read(11, "replsync 3269 14eff32f5c6bd489cc"..., 16384) = 66 13:24:21.722042 write(3, "[2022-06-02 13:24:21.722] NOTICE"..., 120) = 120 13:24:21.722136 write(3, "[2022-06-02 13:24:21.722] NOTICE"..., 174) = 174 13:24:21.722206 write(3, "[2022-06-02 13:24:21.722] NOTICE"..., 250) = 250 13:24:21.722285 write(11, "+FULLRESYNC 84d7193e987226acf221"..., 69) = 69 13:24:21.722372 write(3, "[2022-06-02 13:24:21.722] NOTICE"..., 136) = 136 13:24:21.722440 epoll_ctl(4, EPOLL_CTL_MOD, 11, {EPOLLIN, {u32=11, u64=11}}) = 0 13:24:21.722494 epoll_ctl(4, EPOLL_CTL_MOD, 11, {EPOLLIN|EPOLLOUT, {u32=11, u64=11}}) = 0 13:24:21.722556 write(3, "[2022-06-02 13:24:21.722] NOTICE"..., 149) = 149 13:24:21.722626 epoll_wait(4, [{EPOLLOUT, {u32=11, u64=11}}], 10128, 41) = 1 13:24:21.722763 write(11, "*4\r\n$11\r\nsyncThrough\r\n$1\r\n7\r\n$4\r"..., 2453) = 2453 13:24:21.722842 epoll_wait(4, [{EPOLLOUT, {u32=11, u64=11}}], 10128, 41) = 1 13:24:21.722935 write(11, "*7\r\n$8\r\ntransmit\r\n$2\r\nbj\r\n$1\r\n1\r"..., 2250) = 2250 13:24:21.723017 epoll_wait(4, [{EPOLLOUT, {u32=11, u64=11}}], 10128, 40) = 1 13:24:21.723114 write(11, "*7\r\n$8\r\ntransmit\r\n$2\r\nbj\r\n$1\r\n1\r"..., 2650) = 2650 13:24:21.723188 epoll_wait(4, [{EPOLLOUT, {u32=11, u64=11}}], 10128, 40) = 1 13:24:21.723979 futex(0x7feea820738c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7feea8207388, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 13:24:21.724047 futex(0x7feea82073f8, FUTEX_WAKE_PRIVATE, 1) = 1 13:24:21.724125 write(11, "*7\r\n$8\r\ntransmit\r\n$2\r\nbj\r\n$1\r\n1\r"..., 75) = 75 13:24:21.724197 write(11, "\2B\10@p6_{36888}-unlxosfbgcruwohqa"..., 59175) = 35040 13:24:21.724307 write(11, "@p6_{36888}-unlxosfbgcruwohqanut"..., 24135) = -1 EAGAIN (Resource temporarily unavailable) 13:24:21.724366 epoll_wait(4, [{EPOLLOUT, {u32=11, u64=11}}], 10128, 39) = 1 13:24:21.724798 write(11, "@p6_{36888}-unlxosfbgcruwohqanut"..., 24135) = 24135 13:24:21.724908 futex(0x7feea820738c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7feea8207388, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 13:24:21.724969 futex(0x7feea82073f8, FUTEX_WAKE_PRIVATE, 1) = 1 13:24:21.725027 write(11, "\r\n*7\r\n$8\r\ntransmit\r\n$2\r\nbj\r\n$1\r\n"..., 2652) = -1 EAGAIN (Resource temporarily unavailable) 13:24:21.725089 epoll_wait(4, [{EPOLLOUT, {u32=11, u64=11}}], 10128, 38) = 1 13:24:21.734488 write(11, "\r\n*7\r\n$8\r\ntransmit\r\n$2\r\nbj\r\n$1\r\n"..., 5012) = 5012 13:24:21.734572 epoll_wait(4, [{EPOLLOUT, {u32=11, u64=11}}], 10128, 29) = 1 13:24:21.735334 futex(0x7feea820738c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7feea8207388, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 13:24:21.735394 futex(0x7feea82073f8, FUTEX_WAKE_PRIVATE, 1) = 1 13:24:21.735504 write(11, "*7\r\n$8\r\ntransmit\r\n$2\r\nbj\r\n$1\r\n1\r"..., 75) = 75 13:24:21.735579 write(11, "\2B\10@p5_{26913}-ctebptkvciftuiwrl"..., 59175) = 59175 13:24:21.735650 futex(0x7feea820738c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7feea8207388, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 13:24:21.735708 futex(0x7feea82073f8, FUTEX_WAKE_PRIVATE, 1) = 1 13:24:21.735765 write(11, "\r\n", 2) = 2 13:24:21.764197 futex(0x7feea820738c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7feea8207388, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 13:24:21.764265 futex(0x7feea82073f8, FUTEX_WAKE_PRIVATE, 1) = 1 ```

At the same time, I observed through Netstat that the send queue was piling up

netstat logs:

``` # netstat -anp | grep 31161 tcp 0 59177 [serverip]:6642 11.22.33.44:11510 ESTABLISHED 31161/redis-server ```

CRIU logs and information: restore log:https://github.com/mrc1119/criu-log/blob/main/restore-0606.log

Output of `criu --version`:

``` Version: 3.17 ```

Output of `criu check --all`:

``` Warn (criu/net.c:3435): Unable to get tun network namespace Warn (criu/sk-unix.c:224): unix: Unable to open a socket file: Bad address Warn (criu/net.c:3435): Unable to get socket network namespace Warn (criu/kerndat.c:1336): Can't get pidfd Warn (criu/kerndat.c:1453): CRIU was built without libnftables support Warn (criu/kerndat.c:1117): Can't keep kdat cache on non-tempfs Error (criu/cr-check.c:748): Kernel doesn't support PTRACE_O_SUSPEND_SECCOMP Error (criu/cr-check.c:793): Dumping seccomp filters not supported: Input/output error Error (criu/cr-check.c:1022): cgroupns not supported. This is not fatal. Warn (criu/cr-check.c:1242): Do not have API to map vDSO - will use mremap() to restore vDSO Warn (criu/cr-check.c:1231): clone3() with set_tid not supported Error (criu/cr-check.c:1273): Time namespaces are not supported Error (criu/cr-check.c:1283): IFLA_NEW_IFINDEX isn't supported Warn (criu/cr-check.c:1300): Pidfd store requires pidfd_open syscall which is not supported Warn (criu/cr-check.c:1334): Nftables based locking requires libnftables and set concatenations support Warn (criu/cr-check.c:804): ptrace(PTRACE_GET_RSEQ_CONFIGURATION) isn't supported. C/R of processes which are using rseq() won't work. Warn (criu/cr-check.c:1160): compat_cr is not supported. Requires kernel >= v4.12 Looks good but some kernel features are missing which, depending on your process tree, may cause dump or restore failure. ```

And I don't see any problems with tcpdump.

tcpdump from master shard:

image

tcpdump from slave shard:

image

I need your help!

mrc1119 commented 2 years ago

help~ @Snorch @avagin @rst0git @adrianreber

Snorch commented 2 years ago

Not sure if it's related or not... Once in my experience I saw such a hang of retransmiting packet (criu was not involved). We had several Cisco switches in the middle between the node and it's backup storage, and we saw that when we were sending backup data, random packet may stuck with constant retransmits, because when it is was sent from "first" switch to a "core" switch it surely appears in "first" switch send log and never appears in the "core" switch receive log. We decided it was a Cisco switch bug and reported it to them, but not sure if it resolved later or not.

So my advice is to connect source and destination directly and test if it would reproduce to exclude that packet is randomly lost in the middle.

avagin commented 2 years ago

Error (criu/cr-check.c:1273): Time namespaces are not supported

Have you rebooted the host between dump and restore?

avagin commented 2 years ago

Form the tcp dump, we can see that a connection is alive and there are some data that are being transferred. You need to check that other side reads from the tcp socket.

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.