Process seems to hang after 2m output lines

paul-conneqtech commented 6 years ago

I've been running supercronic in a docker / rancher environment for a couple of months now, and it stalls regularly.

Simplified sketch of my case:

It's a single docker instance running 8 cronjobs every minute. And a couple on a @hourly schedule.
At the end of each job a trigger is send to an influx instance, which in turn has a grafana alerting us when not enough jobs are ran.
The docker logs contain 240 lines every minute, this is a combination of supercronics own logs and the output for each job. (our docker logs are pushed to an elk-stack)
On average once or twice a day a single job runs for more than a minute, so the next execution of that job is skipped.

The issue:

After about 6 days, (happend twice since i started logging in grafana and the elk-stack) supercronic just freezes.
The output log shows a couple of jobs with the message 'job succeeded', and a couple jobs in the middle of their own output, which is a multiline message stopped at a random (but at an expected) newline.
If I enter the container and request the processlist ps aux I see only supercronic, a [sh], and my current bash processes. The [sh] process is not present when supercronic is running normally.

Some (relevant) version information:

supercronic: v0.1.6 (ran v0.1.5 before i added the elk logging/alerting and had the same issue)
alpine 3.6 base container

On a separate environment running only 2 jobs every minute the same happens only less frequent. So I expect it has something to do with the output logging, the last 6 days produced a total of 2,140,101 lines accoording to kibana.

krallin commented 6 years ago

If you have a hung instance of Supercronic, can you send it the ABRT signal using kill -ABRT "$PID_OF_SUPERCRONIC"?

This should cause Supercronic to exit and dump stack traces of all current goroutines to stderr. Can you capture this output and share it here? This should help us identify where it's hung.

Thanks!

krallin commented 6 years ago

I should mention that my guess here is that Supercronic is waiting on your sh process to exit. That said, if that were the case, it should log warnings like not starting: job is still running since ...

paul-conneqtech commented 6 years ago

The next time it hangs i will send the ABRT signal, as extra test i added a job that every 2 seconds just cats the cpuinfo. Which should speed up the hang if my guess is correct (to about 9 hours).

The log is completely silent after a hang. Plus shouldn't the jobs that had logged job succeeded just start again in that case, since that is after the old process is completely done

krallin commented 6 years ago

Yes; you're correct that this shouldn't normally cause Supercronic to stop altogether. The process being there is probably meaningful but I can't really make much of it without the dump.

Thanks!

paul-conneqtech commented 6 years ago

The proces hung again, now after 3,811,653 lines. However the binary doesn't respond to the ABRT signal. Not quite sure why tho, I'm using your prebuild release binary, tried sending the signal via the docker daemon, and tried is sending it while logged into the container, as the user it is running at, and the as the root user. The process doesn't stop. The process is the docker CMD so its PID 1.

krallin commented 6 years ago

Can you use strace to see what the process is doing?

You should be able to do so by running it as root on your host (i.e. not in the container), using strace -p "$PID_OF_SUPERCRONIC". The PID there should be the PID in the main PID namespace (so, not 1). You can get that PID by running docker top "$CONTAINER_ID_OF_SUPERCRONIC", which will give you the PIDs in the main PID namespace.

Ideally, can you try and keep this hung so we can iterate on it?

paul-conneqtech commented 6 years ago

32370 1000      1:48 supercronic /etc/crontabs/conneqtech
32760 1000      0:00 [sh]
04335c67b455:/# strace -p 32370
strace: Process 32370 attached
futex(0x600d50, FUTEX_WAIT, 0, NULL

krallin commented 6 years ago

Sorry, there are probably multiple threads here. Can you get their PIDs using ps -T -p "$PID_OF_SUPERCRONIC", then run strace on all of them (strace -p PID1 -p PID2 ...). The PIDs of the threads are in the SPID column in the ps output.

Could you also try and get the logs coming from Supercronic (i.e. all the logs that are not just forwarded from your app) leading up to the hang?

paul-conneqtech commented 6 years ago

root@7129a3f41ea6:/# strace -p 32370 -p 32444 -p 32445 -p 32446 -p 32447 -p 32448 -p 32449 -p 32481 -p  1167 -p  2519 -p  4398 -p 10343 -p  7407
strace: Process 32370 attached
strace: Process 32444 attached
strace: Process 32445 attached
strace: Process 32446 attached
strace: Process 32447 attached
strace: Process 32448 attached
strace: Process 32449 attached
strace: Process 32481 attached
strace: Process 1167 attached
strace: Process 2519 attached
strace: Process 4398 attached
strace: Process 10343 attached
strace: Process 7407 attached
[pid  7407] futex(0xc42002d990, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  4398] futex(0xc42005cb90, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  2519] futex(0xc420364110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  1167] futex(0xc420364b90, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32481] futex(0xc42005c810, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32445] futex(0x61e660, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 32444] write(2, "SIGABRT: abort", 14 <unfinished ...>
[pid 32449] epoll_wait(6,  <unfinished ...>
[pid 32448] futex(0xc42002d610, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32447] futex(0x61e660, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 32446] futex(0x61e660, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 10343] write(2, "\n", 1 <unfinished ...>
[pid 32370] futex(0x600d50, FUTEX_WAIT, 0, NULL

krallin commented 6 years ago

It looks like stderr is blocked and Supercronic can't write to it:

[pid 32444] write(2, "SIGABRT: abort", 14 <unfinished ...>

In a Docker container, stderr is a pipe, which Docker itself reads from. If the pipe fills up (IIRC the capacity is 64k on Linux), eventually Supercronic will no longer be able to write to the pipe (and writes will block). This appears to be why you never saw the SIGABRT output: it's blocked.

What version of Docker are you running? Do you see any errors in the Docker log?

paul-conneqtech commented 6 years ago

The output of docker version is:

Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.2
 Git commit:   9ee9f40
 Built:        Thu Apr 26 07:12:25 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   9ee9f40
  Built:        Thu Apr 26 07:23:03 2018
  OS/Arch:      linux/amd64
  Experimental: false

I see no errors in the docker log

krallin commented 6 years ago

Can you try and figure out who has the other end of this pipe and what they're doing (using strace again, presumably)?

You can identify the ID of the pipe like so (as root, on the host) ls -lah "/proc/$PID_OF_SUPERCRONIC/fd". The pipe we're interested in is the file named 2 (i.e. stderr).

Then, locate the other end using the instructions there: https://superuser.com/questions/401066/find-what-process-is-on-the-other-end-of-a-pipe

Then, figure out what those threads are doing. This will probably be a docker-containerd-shim process. There will probably be multiple threads (the lines you find will probably look like this: /proc/PID/task/TID/fd/12). Use the TID with strace.

paul-conneqtech commented 6 years ago

Hmm.. maybe because we run on RancherOS, but if I try to identify the ID of the pipe, i see /dev/pts/0. And that is the same for all the containers on the host.

I did do a strace on the threads of the docker-containerd-shim associated with the container. https://pastebin.com/PUZy3fjd

krallin commented 6 years ago

So, /dev/pts0 there suggests your container was started with a TTY allocated. Unfortunately, I'm not certain how Docker TTYs work, and as a result, I'm not sure where things might be getting hung here.

Are you able to try and launch this container without a TTY?

paul-conneqtech commented 6 years ago

Ill make a setup where 1 container is running with tty and one without. and see if there is a difference

krallin commented 6 years ago

Sounds good; thanks, let me know what you find!

paul-conneqtech commented 5 years ago

I changed our main container to run without tty, and it hasn't stalled since.

However the test environment where i have 1 container in tty and 1 container without both run fine also.

bl0ck3man commented 4 years ago

Does anybody have this issue?...

krallin commented 4 years ago

I'm going to close this for now, since there hasn't been activity in a while.

aptible / supercronic

Process seems to hang after 2m output lines #36