Closed paul-conneqtech closed 4 years ago
If you have a hung instance of Supercronic, can you send it the ABRT
signal using kill -ABRT "$PID_OF_SUPERCRONIC"
?
This should cause Supercronic to exit and dump stack traces of all current goroutines to stderr. Can you capture this output and share it here? This should help us identify where it's hung.
Thanks!
I should mention that my guess here is that Supercronic is waiting on your sh
process to exit. That said, if that were the case, it should log warnings like not starting: job is still running since ...
The next time it hangs i will send the ABRT
signal, as extra test i added a job that every 2 seconds just cats the cpuinfo. Which should speed up the hang if my guess is correct (to about 9 hours).
The log is completely silent after a hang.
Plus shouldn't the jobs that had logged job succeeded
just start again in that case, since that is after the old process is completely done
Yes; you're correct that this shouldn't normally cause Supercronic to stop altogether. The process being there is probably meaningful but I can't really make much of it without the dump.
Thanks!
The proces hung again, now after 3,811,653 lines. However the binary doesn't respond to the ABRT signal. Not quite sure why tho, I'm using your prebuild release binary, tried sending the signal via the docker daemon, and tried is sending it while logged into the container, as the user it is running at, and the as the root user. The process doesn't stop.
The process is the docker CMD
so its PID 1.
Can you use strace
to see what the process is doing?
You should be able to do so by running it as root on your host (i.e. not in the container), using strace -p "$PID_OF_SUPERCRONIC"
. The PID there should be the PID in the main PID namespace (so, not 1). You can get that PID by running docker top "$CONTAINER_ID_OF_SUPERCRONIC"
, which will give you the PIDs in the main PID namespace.
Ideally, can you try and keep this hung so we can iterate on it?
32370 1000 1:48 supercronic /etc/crontabs/conneqtech
32760 1000 0:00 [sh]
04335c67b455:/# strace -p 32370
strace: Process 32370 attached
futex(0x600d50, FUTEX_WAIT, 0, NULL
Sorry, there are probably multiple threads here. Can you get their PIDs using ps -T -p "$PID_OF_SUPERCRONIC"
, then run strace
on all of them (strace -p PID1 -p PID2 ...
). The PIDs of the threads are in the SPID
column in the ps
output.
Could you also try and get the logs coming from Supercronic (i.e. all the logs that are not just forwarded from your app) leading up to the hang?
root@7129a3f41ea6:/# strace -p 32370 -p 32444 -p 32445 -p 32446 -p 32447 -p 32448 -p 32449 -p 32481 -p 1167 -p 2519 -p 4398 -p 10343 -p 7407
strace: Process 32370 attached
strace: Process 32444 attached
strace: Process 32445 attached
strace: Process 32446 attached
strace: Process 32447 attached
strace: Process 32448 attached
strace: Process 32449 attached
strace: Process 32481 attached
strace: Process 1167 attached
strace: Process 2519 attached
strace: Process 4398 attached
strace: Process 10343 attached
strace: Process 7407 attached
[pid 7407] futex(0xc42002d990, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 4398] futex(0xc42005cb90, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 2519] futex(0xc420364110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 1167] futex(0xc420364b90, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32481] futex(0xc42005c810, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32445] futex(0x61e660, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 32444] write(2, "SIGABRT: abort", 14 <unfinished ...>
[pid 32449] epoll_wait(6, <unfinished ...>
[pid 32448] futex(0xc42002d610, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32447] futex(0x61e660, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 32446] futex(0x61e660, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 10343] write(2, "\n", 1 <unfinished ...>
[pid 32370] futex(0x600d50, FUTEX_WAIT, 0, NULL
It looks like stderr
is blocked and Supercronic can't write to it:
[pid 32444] write(2, "SIGABRT: abort", 14 <unfinished ...>
In a Docker container, stderr is a pipe, which Docker itself reads from. If the pipe fills up (IIRC the capacity is 64k on Linux), eventually Supercronic will no longer be able to write to the pipe (and writes will block). This appears to be why you never saw the SIGABRT
output: it's blocked.
What version of Docker are you running? Do you see any errors in the Docker log?
The output of docker version
is:
Client:
Version: 18.03.1-ce
API version: 1.37
Go version: go1.9.2
Git commit: 9ee9f40
Built: Thu Apr 26 07:12:25 2018
OS/Arch: linux/amd64
Experimental: false
Orchestrator: swarm
Server:
Engine:
Version: 18.03.1-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.5
Git commit: 9ee9f40
Built: Thu Apr 26 07:23:03 2018
OS/Arch: linux/amd64
Experimental: false
I see no errors in the docker log
Can you try and figure out who has the other end of this pipe and what they're doing (using strace again, presumably)?
You can identify the ID of the pipe like so (as root, on the host) ls -lah "/proc/$PID_OF_SUPERCRONIC/fd"
. The pipe we're interested in is the file named 2
(i.e. stderr
).
Then, locate the other end using the instructions there: https://superuser.com/questions/401066/find-what-process-is-on-the-other-end-of-a-pipe
Then, figure out what those threads are doing. This will probably be a docker-containerd-shim
process. There will probably be multiple threads (the lines you find will probably look like this: /proc/PID/task/TID/fd/12
). Use the TID
with strace
.
Hmm.. maybe because we run on RancherOS, but if I try to identify the ID of the pipe, i see /dev/pts/0
. And that is the same for all the containers on the host.
I did do a strace on the threads of the docker-containerd-shim associated with the container. https://pastebin.com/PUZy3fjd
So, /dev/pts0
there suggests your container was started with a TTY allocated. Unfortunately, I'm not certain how Docker TTYs work, and as a result, I'm not sure where things might be getting hung here.
Are you able to try and launch this container without a TTY?
Ill make a setup where 1 container is running with tty and one without. and see if there is a difference
Sounds good; thanks, let me know what you find!
I changed our main container to run without tty, and it hasn't stalled since.
However the test environment where i have 1 container in tty and 1 container without both run fine also.
Does anybody have this issue?...
I'm going to close this for now, since there hasn't been activity in a while.
I've been running supercronic in a docker / rancher environment for a couple of months now, and it stalls regularly.
Simplified sketch of my case:
The issue:
ps aux
I see only supercronic, a[sh]
, and my current bash processes. The[sh]
process is not present when supercronic is running normally.Some (relevant) version information:
On a separate environment running only 2 jobs every minute the same happens only less frequent. So I expect it has something to do with the output logging, the last 6 days produced a total of 2,140,101 lines accoording to kibana.