Open kuailelijuan opened 2 years ago
this is because criu does not handle unix domain sockets
CRIU restores unix domain sockets. Could you show lsof -p PID before dump and after restore?
lsof -p 535|grep .java java 535 admin txt REG 253,47 7734 4197247 /home/export/servers/jdk1.8.0_60/bin/java java 535 admin mem REG 253,47 225499 41950226 /home/export/servers/jdk1.8.0_60/jre/lib/amd64/libjava.so java 535 admin 179u unix 0xffff8b12eecf5e80 0t0 3233858441 /tmp/.java_pid204.tmp
/tmp/.java_pid204.tmp ,this file is a unix domain socket file, which does not exist and cannot be copied.
@kuailelijuan could you show lsof -p PID after restore?
Here I show some lsof -p PID data after restore, as follows: lsof -p 535|grep .java java 535 admin txt REG 253,47 7734 4197247 /home/export/servers/jdk1.8.0_60/bin/java java 535 admin mem REG 253,47 225499 41950226 /home/export/servers/jdk1.8.0_60/jre/lib/amd64/libjava.so java 535 admin 179u unix 0xffff8b12eecf5e80 0t0 3233858441 /tmp/.java_pid204.tmp
I can confirm that the problem must be related to /tmp/.java_pid204.tmp. It is a unix domian socket file, which cannot be found after restore. I don't know how criu restores the unix domain socket after the process is restored.
@kuailelijuan Could you run ls -l /tmp/.java_pid204.tmp and show its output?
The strace output for jstack after restore will be helpful too: strace -f -s 1024 jstack $PID
ls -lrth /tmp/.java_pid204.tmp ls: 无法访问/tmp/.java_pid204.tmp: 没有那个文件或目录
ls -lrth /tmp/.java_pid204 ls: 无法访问/tmp/.java_pid204: 没有那个文件或目录
strace -f -s 1024 jstack 535 [pid 619255] futex(0x7f7f3c132028, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 619255] futex(0x7f7f3c132054, FUTEX_WAIT_BITSET_PRIVATE, 1, {24517558, 384192418}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 619255] futex(0x7f7f3c132028, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 619255] futex(0x7f7f3c132054, FUTEX_WAIT_BITSET_PRIVATE, 1, {24517558, 434442127}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 619255] futex(0x7f7f3c132028, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 619255] futex(0x7f7f3c132054, FUTEX_WAIT_BITSET_PRIVATE, 1, {24517558, 484649117}, ffffffff <unfinished ...> [pid 619187] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 619187] futex(0x7f7f3c009a28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 619187] stat("/tmp/.java_pid535", 0x7f7f4412d3b0) = -1 ENOENT (No such file or directory) [pid 619187] futex(0x7f7f3c009a54, FUTEX_WAIT_BITSET_PRIVATE, 1, {24517558, 671014045}, ffffffff <unfinished ...> [pid 619255] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
original process pid 204, The original process relies on /tmp/.java_pid204, which is a unix domain socket created by jvm and exists in the original container. The process pid after restore is 535, and neither /tmp/.java_pid204 nor /tmp/.java_pid535 exists.
original process pid 204, The process pid after restore is 535
Pls, provide exact steps for how you dump/restore these processes and attach criu logs to this issue. Do you restore processes in a new pid namespace?
If you can provide steps for a minimal reproducer, it will significantly speed up the investigation.
A friendly reminder that this issue had no activity for 30 days.
Description
535: Unable to open socket file: target process not responding or HotSpot VM not loaded The -F option can be used when the target process is not responding
this is because criu does not handle unix domain sockets