checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.99k stars 599 forks source link

[Python Supervisord] criu restore is successful, But the supervisord program coredump #1477

Open throwbear opened 3 years ago

throwbear commented 3 years ago

I checkpointed and restored a python supervisord program. It seems that restore was successful, but a coredump occurred after running for a while:

At the (gdb) prompt, just run the following command:

root@XXXXXXXX:/# gdb -c supervisord.17.core --args python2.7 GNU gdb (Debian 7.12-6) 7.12.0.20161007-git Copyright (C) 2016 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from python2.7...Reading symbols from /usr/lib/debug/.build-id/0d/a6b8d03dd0c23b9dc6009dd14162bc23cbfb2a.debug...done. done.

warning: core file may not match specified executable file. [New LWP 17] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/bin/python /usr/bin/supervisord -c /etc/supervisor/supervisord.conf'. Program terminated with signal SIGABRT, Aborted.

0 0x00007fb4b67fffff in raise () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) bt

0 0x00007fb4b67fffff in raise () from /lib/x86_64-linux-gnu/libc.so.6

1 0x00007fb4b680142a in abort () from /lib/x86_64-linux-gnu/libc.so.6

2 0x00007fb4b683dc00 in ?? () from /lib/x86_64-linux-gnu/libc.so.6

3 0x00007fb4b6843fc6 in ?? () from /lib/x86_64-linux-gnu/libc.so.6

4 0x00007fb4b6846491 in ?? () from /lib/x86_64-linux-gnu/libc.so.6

5 0x00007fb4b6847f64 in malloc () from /lib/x86_64-linux-gnu/libc.so.6

6 0x0000562e8e19cc5f in PyList_New () at ../Objects/listobject.c:152

7 0x0000562e8e2021e5 in dict_values.lto_priv () at ../Objects/dictobject.c:1294

8 0x0000562e8e1c7a6f in call_function (oparg=, pp_stack=0x7ffe17a75238) at ../Python/ceval.c:4336

9 PyEval_EvalFrameEx () at ../Python/ceval.c:2989

10 0x0000562e8e1c80df in fast_function (nk=, na=, n=, pp_stack=0x7ffe17a75388, func=) at ../Python/ceval.c:4437

11 call_function (oparg=, pp_stack=0x7ffe17a75388) at ../Python/ceval.c:4372

12 PyEval_EvalFrameEx () at ../Python/ceval.c:2989

13 0x0000562e8e1c80df in fast_function (nk=, na=, n=, pp_stack=0x7ffe17a754d8, func=) at ../Python/ceval.c:4437

14 call_function (oparg=, pp_stack=0x7ffe17a754d8) at ../Python/ceval.c:4372

15 PyEval_EvalFrameEx () at ../Python/ceval.c:2989

16 0x0000562e8e1c80df in fast_function (nk=, na=, n=, pp_stack=0x7ffe17a75628, func=) at ../Python/ceval.c:4437

17 call_function (oparg=, pp_stack=0x7ffe17a75628) at ../Python/ceval.c:4372

18 PyEval_EvalFrameEx () at ../Python/ceval.c:2989

19 0x0000562e8e1c80df in fast_function (nk=, na=, n=, pp_stack=0x7ffe17a75778, func=) at ../Python/ceval.c:4437

20 call_function (oparg=, pp_stack=0x7ffe17a75778) at ../Python/ceval.c:4372

21 PyEval_EvalFrameEx () at ../Python/ceval.c:2989

22 0x0000562e8e1c80df in fast_function (nk=, na=, n=, pp_stack=0x7ffe17a758c8, func=) at ../Python/ceval.c:4437

23 call_function (oparg=, pp_stack=0x7ffe17a758c8) at ../Python/ceval.c:4372

24 PyEval_EvalFrameEx () at ../Python/ceval.c:2989

25 0x0000562e8e1c80df in fast_function (nk=, na=, n=, pp_stack=0x7ffe17a75a18, func=) at ../Python/ceval.c:4437

26 call_function (oparg=, pp_stack=0x7ffe17a75a18) at ../Python/ceval.c:4372

27 PyEval_EvalFrameEx () at ../Python/ceval.c:2989

28 0x0000562e8e1c0c35 in PyEval_EvalCodeEx () at ../Python/ceval.c:3584

29 0x0000562e8e1c83e9 in fast_function (nk=0, na=0, n=, pp_stack=0x7ffe17a75c28, func=) at ../Python/ceval.c:4447

30 call_function (oparg=, pp_stack=0x7ffe17a75c28) at ../Python/ceval.c:4372

31 PyEval_EvalFrameEx () at ../Python/ceval.c:2989

32 0x0000562e8e1c0c35 in PyEval_EvalCodeEx () at ../Python/ceval.c:3584

33 0x0000562e8e1c09d9 in PyEval_EvalCode (co=, globals=, locals=) at ../Python/ceval.c:669

34 0x0000562e8e1f144f in run_mod.lto_priv () at ../Python/pythonrun.c:1376

35 0x0000562e8e1ec262 in PyRun_FileExFlags () at ../Python/pythonrun.c:1362

36 0x0000562e8e1ebdae in PyRun_SimpleFileExFlags () at ../Python/pythonrun.c:948

37 0x0000562e8e19c5f1 in Py_Main () at ../Modules/main.c:640

38 0x00007fb4b67ed2e1 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6

39 0x0000562e8e19be1a in _start ()

(gdb) py-bt Traceback (most recent call first): File "/usr/lib/python2.7/dist-packages/supervisor/process.py", line 784, in get_dispatchers for process in self.processes.values(): File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 141, in get_process_map process_map.update(group.get_dispatchers()) File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 192, in runforever combined_map.update(self.get_process_map()) File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 101, in run self.runforever() File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 84, in main self.run() File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 374, in go d.main() File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 364, in main go(options) File "/usr/bin/supervisord", line 11, in load_entry_point('supervisor==3.3.3', 'console_scripts', 'supervisord')()

adrianreber commented 3 years ago

Hmm, that is a tough one. If an application crashes sometimes after restore it is always hard to tell.

The only idea I currently have is that it might be related to jumps in the clock. There is a ticket at supervisord (https://github.com/Supervisor/supervisor/issues/1043) where jumps in the clock lead to problems. Maybe you are seeing something similar.

Are you moving the checkpoint to another machine? Are you rebooting between checkpoints? Maybe you could try to work with a time namespaces to avoid jumps of the clock. A quick look at the source code of supervisord seems to indicate that it is using CLOCK_MONOTONIC. A time namespace could help.

It also seems you are using a really old version of supervisord. Newer version have changes to deal with clock jumps.

Maybe the clock thing can help you. Maybe it is all completely wrong.

rst0git commented 3 years ago

@throwbear Does the problem occur with Python 3 as well?

throwbear commented 3 years ago

Hmm, that is a tough one. If an application crashes sometimes after restore it is always hard to tell.

The only idea I currently have is that it might be related to jumps in the clock. There is a ticket at supervisord (Supervisor/supervisor#1043) where jumps in the clock lead to problems. Maybe you are seeing something similar.

Are you moving the checkpoint to another machine? Are you rebooting between checkpoints? Maybe you could try to work with a time namespaces to avoid jumps of the clock. A quick look at the source code of supervisord seems to indicate that it is using CLOCK_MONOTONIC. A time namespace could help.

It also seems you are using a really old version of supervisord. Newer version have changes to deal with clock jumps.

Maybe the clock thing can help you. Maybe it is all completely wrong.

thanks a lot. Actually, I was trying to restore the process in another machine. Do you mean that a time difference between two machines leads to the supervisord aborted after executing criu-restore

adrianreber commented 3 years ago

Actually, I was trying to restore the process in another machine.

Good to know.

Do you mean that a time difference between two machines leads to the supervisord aborted after executing criu-restore

Could be. At this point I would say this is not a CRIU problem. You should at least first try the newest version of supervisord with Python 3 to see if the error goes away. If that does not help try it with a time namespace.

Anyway, I am just guessing because I saw that there used to be problems with supervisord and clocks. It still might be something completely different.

throwbear commented 3 years ago

Could be. At this point I would say this is not a CRIU problem. You should at least first try the newest version of supervisord with Python 3 to see if the error goes away. If that does not help try it with a time namespace.

Anyway, I am just guessing because I saw that there used to be problems with supervisord and clocks. It still might be something completely different.

As this issue describe, subprocess started by supervisord would hang for a long time if there is a out-of-sync between machines, BUT it would not cause supervisord coredump badly.

there is a similar issue #https://github.com/Supervisor/supervisor/issues/1430

criu version 3.15

rst0git commented 3 years ago

@throwbear It's not supervisord coredump, its Python, but Python 2.7 has been deprecated. If you could replicate the issue with Python 3 it would be easier to debug.

Actually, I was trying to restore the process in another machine. Do you mean that a time difference between two machines leads to the supervisord aborted after executing criu-restore

You can try to restore supervisord first on the same host. To successfully migrate supervisord between two machines you need to make sure that all libraries (e.g. libc) are exactly the same.

throwbear commented 3 years ago

You can try to restore supervisord first on the same host. To successfully migrate supervisord between two machines you need to make sure that all libraries (e.g. libc) are exactly the same.

Thanks a lot. I have install two machine by the same linux images. AND the supervisord-coredump case DOES NOT happen every time when i execute criu-restore from one machine to another. It happened occasionally

avagin commented 3 years ago

What kernel do you use? Does it support time namespaces?

github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.