Open throwbear opened 3 years ago
Hmm, that is a tough one. If an application crashes sometimes after restore it is always hard to tell.
The only idea I currently have is that it might be related to jumps in the clock. There is a ticket at supervisord (https://github.com/Supervisor/supervisor/issues/1043) where jumps in the clock lead to problems. Maybe you are seeing something similar.
Are you moving the checkpoint to another machine? Are you rebooting between checkpoints? Maybe you could try to work with a time namespaces to avoid jumps of the clock. A quick look at the source code of supervisord seems to indicate that it is using CLOCK_MONOTONIC. A time namespace could help.
It also seems you are using a really old version of supervisord. Newer version have changes to deal with clock jumps.
Maybe the clock thing can help you. Maybe it is all completely wrong.
@throwbear Does the problem occur with Python 3 as well?
Hmm, that is a tough one. If an application crashes sometimes after restore it is always hard to tell.
The only idea I currently have is that it might be related to jumps in the clock. There is a ticket at supervisord (Supervisor/supervisor#1043) where jumps in the clock lead to problems. Maybe you are seeing something similar.
Are you moving the checkpoint to another machine? Are you rebooting between checkpoints? Maybe you could try to work with a time namespaces to avoid jumps of the clock. A quick look at the source code of supervisord seems to indicate that it is using CLOCK_MONOTONIC. A time namespace could help.
It also seems you are using a really old version of supervisord. Newer version have changes to deal with clock jumps.
Maybe the clock thing can help you. Maybe it is all completely wrong.
thanks a lot. Actually, I was trying to restore the process in another machine. Do you mean that a time difference between two machines leads to the supervisord aborted after executing criu-restore
Actually, I was trying to restore the process in another machine.
Good to know.
Do you mean that a time difference between two machines leads to the supervisord aborted after executing criu-restore
Could be. At this point I would say this is not a CRIU problem. You should at least first try the newest version of supervisord with Python 3 to see if the error goes away. If that does not help try it with a time namespace.
Anyway, I am just guessing because I saw that there used to be problems with supervisord and clocks. It still might be something completely different.
Could be. At this point I would say this is not a CRIU problem. You should at least first try the newest version of supervisord with Python 3 to see if the error goes away. If that does not help try it with a time namespace.
Anyway, I am just guessing because I saw that there used to be problems with supervisord and clocks. It still might be something completely different.
As this issue describe, subprocess started by supervisord would hang for a long time if there is a out-of-sync between machines, BUT it would not cause supervisord coredump badly.
there is a similar issue #https://github.com/Supervisor/supervisor/issues/1430
criu version 3.15
@throwbear It's not supervisord coredump, its Python, but Python 2.7 has been deprecated. If you could replicate the issue with Python 3 it would be easier to debug.
Actually, I was trying to restore the process in another machine. Do you mean that a time difference between two machines leads to the supervisord aborted after executing criu-restore
You can try to restore supervisord first on the same host. To successfully migrate supervisord between two machines you need to make sure that all libraries (e.g. libc) are exactly the same.
You can try to restore supervisord first on the same host. To successfully migrate supervisord between two machines you need to make sure that all libraries (e.g. libc) are exactly the same.
Thanks a lot. I have install two machine by the same linux images. AND the supervisord-coredump case DOES NOT happen every time when i execute criu-restore from one machine to another. It happened occasionally
What kernel do you use? Does it support time namespaces?
A friendly reminder that this issue had no activity for 30 days.
I checkpointed and restored a python supervisord program. It seems that restore was successful, but a coredump occurred after running for a while:
At the (gdb) prompt, just run the following command:
root@XXXXXXXX:/# gdb -c supervisord.17.core --args python2.7 GNU gdb (Debian 7.12-6) 7.12.0.20161007-git Copyright (C) 2016 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from python2.7...Reading symbols from /usr/lib/debug/.build-id/0d/a6b8d03dd0c23b9dc6009dd14162bc23cbfb2a.debug...done. done.
warning: core file may not match specified executable file. [New LWP 17] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/bin/python /usr/bin/supervisord -c /etc/supervisor/supervisord.conf'. Program terminated with signal SIGABRT, Aborted.
0 0x00007fb4b67fffff in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
0 0x00007fb4b67fffff in raise () from /lib/x86_64-linux-gnu/libc.so.6
1 0x00007fb4b680142a in abort () from /lib/x86_64-linux-gnu/libc.so.6
2 0x00007fb4b683dc00 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
3 0x00007fb4b6843fc6 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
4 0x00007fb4b6846491 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
5 0x00007fb4b6847f64 in malloc () from /lib/x86_64-linux-gnu/libc.so.6
6 0x0000562e8e19cc5f in PyList_New () at ../Objects/listobject.c:152
7 0x0000562e8e2021e5 in dict_values.lto_priv () at ../Objects/dictobject.c:1294
8 0x0000562e8e1c7a6f in call_function (oparg=, pp_stack=0x7ffe17a75238) at ../Python/ceval.c:4336
9 PyEval_EvalFrameEx () at ../Python/ceval.c:2989
10 0x0000562e8e1c80df in fast_function (nk=, na=, n=, pp_stack=0x7ffe17a75388, func=) at ../Python/ceval.c:4437
11 call_function (oparg=, pp_stack=0x7ffe17a75388) at ../Python/ceval.c:4372
12 PyEval_EvalFrameEx () at ../Python/ceval.c:2989
13 0x0000562e8e1c80df in fast_function (nk=, na=, n=, pp_stack=0x7ffe17a754d8, func=) at ../Python/ceval.c:4437
14 call_function (oparg=, pp_stack=0x7ffe17a754d8) at ../Python/ceval.c:4372
15 PyEval_EvalFrameEx () at ../Python/ceval.c:2989
16 0x0000562e8e1c80df in fast_function (nk=, na=, n=, pp_stack=0x7ffe17a75628, func=) at ../Python/ceval.c:4437
17 call_function (oparg=, pp_stack=0x7ffe17a75628) at ../Python/ceval.c:4372
18 PyEval_EvalFrameEx () at ../Python/ceval.c:2989
19 0x0000562e8e1c80df in fast_function (nk=, na=, n=, pp_stack=0x7ffe17a75778, func=) at ../Python/ceval.c:4437
20 call_function (oparg=, pp_stack=0x7ffe17a75778) at ../Python/ceval.c:4372
21 PyEval_EvalFrameEx () at ../Python/ceval.c:2989
22 0x0000562e8e1c80df in fast_function (nk=, na=, n=, pp_stack=0x7ffe17a758c8, func=) at ../Python/ceval.c:4437
23 call_function (oparg=, pp_stack=0x7ffe17a758c8) at ../Python/ceval.c:4372
24 PyEval_EvalFrameEx () at ../Python/ceval.c:2989
25 0x0000562e8e1c80df in fast_function (nk=, na=, n=, pp_stack=0x7ffe17a75a18, func=) at ../Python/ceval.c:4437
26 call_function (oparg=, pp_stack=0x7ffe17a75a18) at ../Python/ceval.c:4372
27 PyEval_EvalFrameEx () at ../Python/ceval.c:2989
28 0x0000562e8e1c0c35 in PyEval_EvalCodeEx () at ../Python/ceval.c:3584
29 0x0000562e8e1c83e9 in fast_function (nk=0, na=0, n=, pp_stack=0x7ffe17a75c28, func=) at ../Python/ceval.c:4447
30 call_function (oparg=, pp_stack=0x7ffe17a75c28) at ../Python/ceval.c:4372
31 PyEval_EvalFrameEx () at ../Python/ceval.c:2989
32 0x0000562e8e1c0c35 in PyEval_EvalCodeEx () at ../Python/ceval.c:3584
33 0x0000562e8e1c09d9 in PyEval_EvalCode (co=, globals=, locals=) at ../Python/ceval.c:669
34 0x0000562e8e1f144f in run_mod.lto_priv () at ../Python/pythonrun.c:1376
35 0x0000562e8e1ec262 in PyRun_FileExFlags () at ../Python/pythonrun.c:1362
36 0x0000562e8e1ebdae in PyRun_SimpleFileExFlags () at ../Python/pythonrun.c:948
37 0x0000562e8e19c5f1 in Py_Main () at ../Modules/main.c:640
38 0x00007fb4b67ed2e1 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
39 0x0000562e8e19be1a in _start ()
(gdb) py-bt Traceback (most recent call first): File "/usr/lib/python2.7/dist-packages/supervisor/process.py", line 784, in get_dispatchers for process in self.processes.values(): File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 141, in get_process_map process_map.update(group.get_dispatchers()) File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 192, in runforever combined_map.update(self.get_process_map()) File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 101, in run self.runforever() File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 84, in main self.run() File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 374, in go d.main() File "/usr/lib/python2.7/dist-packages/supervisor/supervisord.py", line 364, in main go(options) File "/usr/bin/supervisord", line 11, in load_entry_point('supervisor==3.3.3', 'console_scripts', 'supervisord')()