checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.94k stars 587 forks source link

Unable to open socket file #1748

Open kuailelijuan opened 2 years ago

kuailelijuan commented 2 years ago

Description

  1. start any java process
  2. jstack $PID
  3. restore a java process by criu
  4. jstack $PID again, an error occurred

535: Unable to open socket file: target process not responding or HotSpot VM not loaded The -F option can be used when the target process is not responding

this is because criu does not handle unix domain sockets

avagin commented 2 years ago

this is because criu does not handle unix domain sockets

CRIU restores unix domain sockets. Could you show lsof -p PID before dump and after restore?

kuailelijuan commented 2 years ago

lsof -p 535|grep .java java 535 admin txt REG 253,47 7734 4197247 /home/export/servers/jdk1.8.0_60/bin/java java 535 admin mem REG 253,47 225499 41950226 /home/export/servers/jdk1.8.0_60/jre/lib/amd64/libjava.so java 535 admin 179u unix 0xffff8b12eecf5e80 0t0 3233858441 /tmp/.java_pid204.tmp

/tmp/.java_pid204.tmp ,this file is a unix domain socket file, which does not exist and cannot be copied.

avagin commented 2 years ago

@kuailelijuan could you show lsof -p PID after restore?

kuailelijuan commented 2 years ago

Here I show some lsof -p PID data after restore, as follows: lsof -p 535|grep .java java 535 admin txt REG 253,47 7734 4197247 /home/export/servers/jdk1.8.0_60/bin/java java 535 admin mem REG 253,47 225499 41950226 /home/export/servers/jdk1.8.0_60/jre/lib/amd64/libjava.so java 535 admin 179u unix 0xffff8b12eecf5e80 0t0 3233858441 /tmp/.java_pid204.tmp

I can confirm that the problem must be related to /tmp/.java_pid204.tmp. It is a unix domian socket file, which cannot be found after restore. I don't know how criu restores the unix domain socket after the process is restored.

avagin commented 2 years ago

@kuailelijuan Could you run ls -l /tmp/.java_pid204.tmp and show its output?

The strace output for jstack after restore will be helpful too: strace -f -s 1024 jstack $PID

kuailelijuan commented 2 years ago

ls -lrth /tmp/.java_pid204.tmp ls: 无法访问/tmp/.java_pid204.tmp: 没有那个文件或目录

ls -lrth /tmp/.java_pid204 ls: 无法访问/tmp/.java_pid204: 没有那个文件或目录

strace -f -s 1024 jstack 535 [pid 619255] futex(0x7f7f3c132028, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 619255] futex(0x7f7f3c132054, FUTEX_WAIT_BITSET_PRIVATE, 1, {24517558, 384192418}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 619255] futex(0x7f7f3c132028, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 619255] futex(0x7f7f3c132054, FUTEX_WAIT_BITSET_PRIVATE, 1, {24517558, 434442127}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 619255] futex(0x7f7f3c132028, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 619255] futex(0x7f7f3c132054, FUTEX_WAIT_BITSET_PRIVATE, 1, {24517558, 484649117}, ffffffff <unfinished ...> [pid 619187] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 619187] futex(0x7f7f3c009a28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 619187] stat("/tmp/.java_pid535", 0x7f7f4412d3b0) = -1 ENOENT (No such file or directory) [pid 619187] futex(0x7f7f3c009a54, FUTEX_WAIT_BITSET_PRIVATE, 1, {24517558, 671014045}, ffffffff <unfinished ...> [pid 619255] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)

original process pid 204, The original process relies on /tmp/.java_pid204, which is a unix domain socket created by jvm and exists in the original container. The process pid after restore is 535, and neither /tmp/.java_pid204 nor /tmp/.java_pid535 exists.

avagin commented 2 years ago

original process pid 204, The process pid after restore is 535

Pls, provide exact steps for how you dump/restore these processes and attach criu logs to this issue. Do you restore processes in a new pid namespace?

If you can provide steps for a minimal reproducer, it will significantly speed up the investigation.

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.