ConSol-Monitoring / mod-gearman-worker-go

Mod-Gearman Worker rewrite in Golang
GNU General Public License v3.0
7 stars 10 forks source link

Gearman-worker-go crashes/vanishes randomly #16

Closed infraweavers closed 2 years ago

infraweavers commented 2 years ago

Hiya,

We see occasionally (maybe about 10-12 times per year), the gearman-worker-go process just "end" or vanish.

There's nothing in the ~/var/log/mod_gearman/worker.log when it happens, both dmesg and /var/log/syslog are empty, so it doesn't look like it's segfaulting or anything.

Running OMD 4.40 on Debian 10.

worker.conf:

debug=0
logfile=/omd/sites/default/var/log/gearman/worker.log
config=/omd/sites/default/etc/mod-gearman/port.conf
dupserver=partnerserver:4730
eventhandler=yes
notifications=yes
services=yes
hosts=yes
encryption=yes
keyfile=/omd/sites/default/etc/mod-gearman/secret.key
job_timeout=60
min-worker=100
max-worker=500
idle-timeout=30
max-jobs=1000
max-age=0
spawn-rate=100
fork_on_exec=no
load_limit1=0
load_limit5=0
load_limit15=0
show_error_output=yes
enable_embedded_perl=on
use_embedded_perl_implicitly=off
use_perl_cache=on
p1_file=/omd/sites/default/share/mod_gearman/mod_gearman_p1.pl

I think we're probably going to see if we can bump debug up to 1 without filling up the disks etc to see if that will shine any light on what is happening, unless there are any better/other suggestions.

sni commented 2 years ago

never heard of anything like that.

dirtyren commented 2 years ago

Did you check de kernel logs to see if the kernel had killed the process for some reason @infraweavers ?

infraweavers commented 2 years ago

Yeah, there was nothing in dmesg :/ @dirtyren

infraweavers commented 2 years ago

I'm going to close this until it happens again. Hasn't happened since Feburary. We've since upgrade the OMD instances to Debian 11 so it's possible that it was something to do with D10 but ¯\_(ツ)_/¯

dvnscr commented 1 year ago

Even on Debian 11 I have never been able to use OMD gearman_worker without monit. It just vanishes without any trace (as if it is stopped by itself). Logs do not have robust information, just an action of shutting down worker.

...
[2023-07-23 14:20:11.127][Debug][worker.go:299] finished host          job: handle: H:<job server>:4600970 - host: <edited> (took: 0.000s | exec: exec)
[2023-07-23 14:20:11.183][Debug][worker.go:297] finished service       job: handle: H:<job server>:4514865 - host: <edited> - service: HTTPS (took: 13.000s | exec: exec)
[2023-07-23 14:20:11.268][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.268][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.268][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.268][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.268][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down

It may be related somehow to Gearmand itself, since Gearmand is hanging constantly too, maybe it influences workers to misbehave. Still trying to figure.

sni commented 1 year ago

Please update to the latest 5.20 OMD. The worker should run smooth and fine now.

dvnscr commented 1 year ago

Updated. Didn't help. Keeps vanishing even more often. And Gearmand hangs every 20minutes or so. Strange behavior is that Gearmand was working for half an hour with 24 worker nodes connected to it. The moment i did omd start gearman_worker on 25th worker node gearmanD became unresponsive.

The last worker log:

[2023-07-25 12:24:24.213][Warn][resultserver.go:103] failed to send back result, will continue to retry for 2 minutes: %!w(*fmt.wrapError=&{client: dial tcp <edited gearmand server>:4730: connect: connection timed out 0xc000908410})
[2023-07-25 12:24:24.213][Warn][resultserver.go:103] failed to send back result, will continue to retry for 2 minutes: %!w(*fmt.wrapError=&{client: dial tcp <edited gearmand server>:4730: connect: connection timed out 0xc000420230})
[2023-07-25 12:24:24.213][Warn][resultserver.go:103] failed to send back result, will continue to retry for 2 minutes: %!w(*fmt.wrapError=&{client: dial tcp <edited gearmand server>:4730: connect: connection timed out 0xc00087e640})
[2023-07-25 12:25:14.389][Warn][resultserver.go:103] failed to send back result, will continue to retry for 2 minutes: %!w(*fmt.wrapError=&{client: dial tcp <edited gearmand server>:4730: connect: connection timed out 0xc000908820})
[2023-07-25 12:25:14.389][Warn][resultserver.go:103] failed to send back result, will continue to retry for 2 minutes: %!w(*fmt.wrapError=&{client: dial tcp <edited gearmand server>:4730: connect: connection timed out 0xc000024960})
[2023-07-25 12:41:58.951][Info][mainWorker.go:545] 10s timeout hit while waiting for all workers to stop, remaining: 66
[2023-07-25 13:47:21.638][Info][mainWorker.go:545] 10s timeout hit while waiting for all workers to stop, remaining: 2

The time does not correlate with Gearmand.log though much:

ERROR 2023-07-25 13:53:49.000000 [  main ] send() failed, closing connection(Connection timed out) -> libgearman-server/io.cc:404
  ERROR 2023-07-25 13:53:49.000000 [  main ] send() failed, closing connection(Connection timed out) -> libgearman-server/io.cc:404
  ERROR 2023-07-25 13:53:49.000000 [  main ] send() failed, closing connection(Connection timed out) -> libgearman-server/io.cc:404
....... <repeating>
 ERROR 2023-07-25 13:56:12.000000 [  main ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221
  ERROR 2023-07-25 13:56:12.000000 [  main ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221
  ERROR 2023-07-25 13:56:12.000000 [  main ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221
  ERROR 2023-07-25 14:22:05.000000 [  main ] send() failed, closing connection(Connection timed out) -> libgearman-server/io.cc:404
  ERROR 2023-07-25 14:22:05.000000 [  main ] send() failed, closing connection(Connection timed out) -> libgearman-server/io.cc:404
  ERROR 2023-07-25 14:22:05.000000 [  main ] send() failed, closing connection(Connection timed out) -> libgearman-server/io.cc:404
  ERROR 2023-07-25 14:22:05.000000 [  main ] send() failed, closing connection(Connection timed out) -> libgearman-server/io.cc:404
  ERROR 2023-07-25 14:22:05.000000 [  main ] send() failed, closing connection(Connection timed out) -> libgearman-server/io.cc:404

neb.log:

[2023-07-25 13:54:08][710354][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to 10.0.1.100:4730 failed -> libgearman/connection.cc:724: pid(710354)
[2023-07-25 13:54:08][710354][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to 10.0.1.100:4730 failed -> libgearman/connection.cc:724: pid(710354)
<..repeating..>
[2023-07-25 13:54:11][710354][INFO ] gearmand submission statistics: jobs:   4771   errors:      11   submit_rate:   80.6/s   avg_submit_duration: 0.012004s   max_submit_duration: 5.005129s
[2023-07-25 14:08:30][887133][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to <edit>:4730 failed -> libgearman/connection.cc:724: pid(887133)
[2023-07-25 14:08:30][887133][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to <edit>:4730 failed -> libgearman/connection.cc:724: pid(887133)
<..repeating..>
[2023-07-25 14:19:42][887133][INFO ] add_job_to_queue() retrying... 0
[2023-07-25 14:19:57][887133][ERROR] sending job to gearmand failed: gearman_wait(GEARMAN_TIMEOUT) timeout reached, 1 servers were poll(), no servers were available, pipe:false -> libgearman/universal.cc:346: pid(887133)
[2023-07-25 14:19:57][887133][ERROR] failed to send service check to gearmand
[2023-07-25 14:20:07][887133][ERROR] failed to send service check to gearmand
[2023-07-25 14:20:17][887133][ERROR] failed to send service check to gearmand
[2023-07-25 14:20:22][887133][INFO ] gearmand submission statistics: jobs:    493   errors:       8   submit_rate:    8.0/s   avg_submit_duration: 0.111987s   max_submit_duration: 5.750770s
[2023-07-25 14:20:27][887133][ERROR] failed to send host check to gearmand
[2023-07-25 14:20:37][887133][ERROR] failed to send service check to gearmand
[2023-07-25 14:20:47][887133][ERROR] failed to send service check to gearmand
[2023-07-25 14:20:57][887133][ERROR] sending job to gearmand failed: gearman_wait(GEARMAN_TIMEOUT) timeout reached, 1 servers were poll(), no servers were available, pipe:false -> libgearman/universal.cc:346: pid(887133) (6 lost jobs so far)
[2023-07-25 14:20:57][887133][ERROR] failed to send host check to gearmand
[2023-07-25 14:21:07][887133][ERROR] failed to send host check to gearmand
[2023-07-25 14:21:17][887133][ERROR] failed to send service check to gearmand
[2023-07-25 14:21:22][887133][INFO ] gearmand submission statistics: jobs:     12   errors:      12   submit_rate:    0.0/s   avg_submit_duration: 5.005104s   max_submit_duration: 5.005132s
[2023-07-25 14:21:27][887133][ERROR] failed to send host check to gearmand
[2023-07-25 14:21:37][887133][ERROR] failed to send host check to gearmand
[2023-07-25 14:21:47][887133][ERROR] failed to send host check to gearmand
[2023-07-25 14:21:57][887133][ERROR] sending job to gearmand failed: gearman_wait(GEARMAN_TIMEOUT) timeout reached, 1 servers were poll(), no servers were available, pipe:false -> libgearman/universal.cc:346: pid(887133) (12 lost jobs so far)
[2023-07-25 14:21:57][887133][ERROR] failed to send host check to gearmand
[2023-07-25 14:22:06][887133][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to <edit>:4730 failed -> libgearman/connection.cc:724: pid(887133)
[2023-07-25 14:22:06][887133][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to <edit>:4730 failed -> libgearman/connection.cc:724: pid(887133)
...

After restarting with debug mode, it almost immediately hangs:

[2023-07-25 14:34:13][922565][DEBUG] finished initializing
[2023-07-25 14:34:15][922565][DEBUG] registered neb callbacks
[2023-07-25 14:34:15][922565][DEBUG] started result_worker thread for queue: check_results
<..repeat..>
[2023-07-25 14:34:20][922565][INFO ] add_job_to_queue() retrying... 0
[2023-07-25 14:34:25][922565][ERROR] sending job to gearmand failed: gearman_wait(GEARMAN_TIMEOUT) timeout reached, 1 servers were poll(), no servers were available, pipe:false -> libgearman/universal.cc:346: pid(922565)
[2023-07-25 14:34:25][922565][ERROR] failed to send host check to gearmand
[2023-07-25 14:34:31][922565][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to <edit>:4730 failed -> libgearman/connection.cc:724: pid(922565)
[2023-07-25 14:34:31][922565][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to <edit>:4730 failed -> libgearman/connection.cc:724: pid(922565)
<.. repeating ..>
[2023-07-25 14:35:03][922565][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to <edit>:4730 failed -> libgearman/connection.cc:724: pid(922565)
[2023-07-25 14:35:03][922565][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to <edit>:4730 failed -> libgearman/connection.cc:724: pid(922565)
[2023-07-25 14:35:04][922565][DEBUG] started result_worker thread for queue: check_results
[2023-07-25 14:35:04][922565][DEBUG] started result_worker thread for queue: check_results
[2023-07-25 14:35:04][922565][DEBUG] started result_worker thread for queue: check_results
<..repeat..>
(50 result workers)

It repeats until it starts accepting jobs again, then after some time it hangs again. And usually workers on worker nodes get vanished after they can't establish connection to gearmanD for some time. Is it possible that gearmanD has some max connection it can accept ? (I have 27 nodes connecting to gearmand, configured each with 50-100 workers, the check count is 6k~ hosts/ 80k~ services, the instability starts happening around 500 job/s submission rate and around 15k checks waiting in queue, the most stable is when there is 150job/s and almost no jobs waiting in queue).

When it hangs and is being restarted, the job/s may be very low it won't make itself stable for some time (until restarted few more times together with naemon), thus in logs here you won't see big numbers.

sni commented 1 year ago

from this description this could be either some firewall issues or the gearmand hits the number of open files or something like that.

dvnscr commented 1 year ago

Unfortunately it is not so easy. If there would a firewall problem then local gearman worker wouldn't have problems connecting. By stracing gearmand process itself when it hangs there are lots of Timeouts happening to itself. For now it stabilized and running smoothly for an hour. I have changed --threads from 0 to - -thread=10 (0 - used two threads, which may have caused the bottleneck, but too early to tell)

dvnscr commented 1 year ago

Workers are still vanishing, but at much slower rate. Logs are empty. Gearmand is still running after increase of the threads.

dvnscr commented 1 year ago

Workers are still vanishing rarely, but gearmand is running very stable, without hanging and crashing for almost a month after --thread count increase to 10.