Closed infraweavers closed 2 years ago
never heard of anything like that.
Did you check de kernel logs to see if the kernel had killed the process for some reason @infraweavers ?
Yeah, there was nothing in dmesg
:/ @dirtyren
I'm going to close this until it happens again. Hasn't happened since Feburary. We've since upgrade the OMD instances to Debian 11 so it's possible that it was something to do with D10 but ¯\_(ツ)_/¯
Even on Debian 11 I have never been able to use OMD gearman_worker without monit
. It just vanishes without any trace (as if it is stopped by itself). Logs do not have robust information, just an action of shutting down worker.
...
[2023-07-23 14:20:11.127][Debug][worker.go:299] finished host job: handle: H:<job server>:4600970 - host: <edited> (took: 0.000s | exec: exec)
[2023-07-23 14:20:11.183][Debug][worker.go:297] finished service job: handle: H:<job server>:4514865 - host: <edited> - service: HTTPS (took: 13.000s | exec: exec)
[2023-07-23 14:20:11.268][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.268][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.268][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.268][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.268][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
[2023-07-23 14:20:11.269][Debug][worker.go:256] worker shutting down
It may be related somehow to Gearmand itself, since Gearmand is hanging constantly too, maybe it influences workers to misbehave. Still trying to figure.
Please update to the latest 5.20 OMD. The worker should run smooth and fine now.
Updated. Didn't help. Keeps vanishing even more often. And Gearmand hangs every 20minutes or so. Strange behavior is that Gearmand was working for half an hour with 24 worker nodes connected to it. The moment i did omd start gearman_worker
on 25th worker node gearmanD became unresponsive.
The last worker log:
[2023-07-25 12:24:24.213][Warn][resultserver.go:103] failed to send back result, will continue to retry for 2 minutes: %!w(*fmt.wrapError=&{client: dial tcp <edited gearmand server>:4730: connect: connection timed out 0xc000908410})
[2023-07-25 12:24:24.213][Warn][resultserver.go:103] failed to send back result, will continue to retry for 2 minutes: %!w(*fmt.wrapError=&{client: dial tcp <edited gearmand server>:4730: connect: connection timed out 0xc000420230})
[2023-07-25 12:24:24.213][Warn][resultserver.go:103] failed to send back result, will continue to retry for 2 minutes: %!w(*fmt.wrapError=&{client: dial tcp <edited gearmand server>:4730: connect: connection timed out 0xc00087e640})
[2023-07-25 12:25:14.389][Warn][resultserver.go:103] failed to send back result, will continue to retry for 2 minutes: %!w(*fmt.wrapError=&{client: dial tcp <edited gearmand server>:4730: connect: connection timed out 0xc000908820})
[2023-07-25 12:25:14.389][Warn][resultserver.go:103] failed to send back result, will continue to retry for 2 minutes: %!w(*fmt.wrapError=&{client: dial tcp <edited gearmand server>:4730: connect: connection timed out 0xc000024960})
[2023-07-25 12:41:58.951][Info][mainWorker.go:545] 10s timeout hit while waiting for all workers to stop, remaining: 66
[2023-07-25 13:47:21.638][Info][mainWorker.go:545] 10s timeout hit while waiting for all workers to stop, remaining: 2
The time does not correlate with Gearmand.log though much:
ERROR 2023-07-25 13:53:49.000000 [ main ] send() failed, closing connection(Connection timed out) -> libgearman-server/io.cc:404
ERROR 2023-07-25 13:53:49.000000 [ main ] send() failed, closing connection(Connection timed out) -> libgearman-server/io.cc:404
ERROR 2023-07-25 13:53:49.000000 [ main ] send() failed, closing connection(Connection timed out) -> libgearman-server/io.cc:404
....... <repeating>
ERROR 2023-07-25 13:56:12.000000 [ main ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221
ERROR 2023-07-25 13:56:12.000000 [ main ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221
ERROR 2023-07-25 13:56:12.000000 [ main ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:221
ERROR 2023-07-25 14:22:05.000000 [ main ] send() failed, closing connection(Connection timed out) -> libgearman-server/io.cc:404
ERROR 2023-07-25 14:22:05.000000 [ main ] send() failed, closing connection(Connection timed out) -> libgearman-server/io.cc:404
ERROR 2023-07-25 14:22:05.000000 [ main ] send() failed, closing connection(Connection timed out) -> libgearman-server/io.cc:404
ERROR 2023-07-25 14:22:05.000000 [ main ] send() failed, closing connection(Connection timed out) -> libgearman-server/io.cc:404
ERROR 2023-07-25 14:22:05.000000 [ main ] send() failed, closing connection(Connection timed out) -> libgearman-server/io.cc:404
neb.log:
[2023-07-25 13:54:08][710354][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to 10.0.1.100:4730 failed -> libgearman/connection.cc:724: pid(710354)
[2023-07-25 13:54:08][710354][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to 10.0.1.100:4730 failed -> libgearman/connection.cc:724: pid(710354)
<..repeating..>
[2023-07-25 13:54:11][710354][INFO ] gearmand submission statistics: jobs: 4771 errors: 11 submit_rate: 80.6/s avg_submit_duration: 0.012004s max_submit_duration: 5.005129s
[2023-07-25 14:08:30][887133][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to <edit>:4730 failed -> libgearman/connection.cc:724: pid(887133)
[2023-07-25 14:08:30][887133][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to <edit>:4730 failed -> libgearman/connection.cc:724: pid(887133)
<..repeating..>
[2023-07-25 14:19:42][887133][INFO ] add_job_to_queue() retrying... 0
[2023-07-25 14:19:57][887133][ERROR] sending job to gearmand failed: gearman_wait(GEARMAN_TIMEOUT) timeout reached, 1 servers were poll(), no servers were available, pipe:false -> libgearman/universal.cc:346: pid(887133)
[2023-07-25 14:19:57][887133][ERROR] failed to send service check to gearmand
[2023-07-25 14:20:07][887133][ERROR] failed to send service check to gearmand
[2023-07-25 14:20:17][887133][ERROR] failed to send service check to gearmand
[2023-07-25 14:20:22][887133][INFO ] gearmand submission statistics: jobs: 493 errors: 8 submit_rate: 8.0/s avg_submit_duration: 0.111987s max_submit_duration: 5.750770s
[2023-07-25 14:20:27][887133][ERROR] failed to send host check to gearmand
[2023-07-25 14:20:37][887133][ERROR] failed to send service check to gearmand
[2023-07-25 14:20:47][887133][ERROR] failed to send service check to gearmand
[2023-07-25 14:20:57][887133][ERROR] sending job to gearmand failed: gearman_wait(GEARMAN_TIMEOUT) timeout reached, 1 servers were poll(), no servers were available, pipe:false -> libgearman/universal.cc:346: pid(887133) (6 lost jobs so far)
[2023-07-25 14:20:57][887133][ERROR] failed to send host check to gearmand
[2023-07-25 14:21:07][887133][ERROR] failed to send host check to gearmand
[2023-07-25 14:21:17][887133][ERROR] failed to send service check to gearmand
[2023-07-25 14:21:22][887133][INFO ] gearmand submission statistics: jobs: 12 errors: 12 submit_rate: 0.0/s avg_submit_duration: 5.005104s max_submit_duration: 5.005132s
[2023-07-25 14:21:27][887133][ERROR] failed to send host check to gearmand
[2023-07-25 14:21:37][887133][ERROR] failed to send host check to gearmand
[2023-07-25 14:21:47][887133][ERROR] failed to send host check to gearmand
[2023-07-25 14:21:57][887133][ERROR] sending job to gearmand failed: gearman_wait(GEARMAN_TIMEOUT) timeout reached, 1 servers were poll(), no servers were available, pipe:false -> libgearman/universal.cc:346: pid(887133) (12 lost jobs so far)
[2023-07-25 14:21:57][887133][ERROR] failed to send host check to gearmand
[2023-07-25 14:22:06][887133][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to <edit>:4730 failed -> libgearman/connection.cc:724: pid(887133)
[2023-07-25 14:22:06][887133][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to <edit>:4730 failed -> libgearman/connection.cc:724: pid(887133)
...
After restarting with debug mode, it almost immediately hangs:
[2023-07-25 14:34:13][922565][DEBUG] finished initializing
[2023-07-25 14:34:15][922565][DEBUG] registered neb callbacks
[2023-07-25 14:34:15][922565][DEBUG] started result_worker thread for queue: check_results
<..repeat..>
[2023-07-25 14:34:20][922565][INFO ] add_job_to_queue() retrying... 0
[2023-07-25 14:34:25][922565][ERROR] sending job to gearmand failed: gearman_wait(GEARMAN_TIMEOUT) timeout reached, 1 servers were poll(), no servers were available, pipe:false -> libgearman/universal.cc:346: pid(922565)
[2023-07-25 14:34:25][922565][ERROR] failed to send host check to gearmand
[2023-07-25 14:34:31][922565][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to <edit>:4730 failed -> libgearman/connection.cc:724: pid(922565)
[2023-07-25 14:34:31][922565][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to <edit>:4730 failed -> libgearman/connection.cc:724: pid(922565)
<.. repeating ..>
[2023-07-25 14:35:03][922565][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to <edit>:4730 failed -> libgearman/connection.cc:724: pid(922565)
[2023-07-25 14:35:03][922565][ERROR] worker error: flush(GEARMAN_COULD_NOT_CONNECT) Connection to <edit>:4730 failed -> libgearman/connection.cc:724: pid(922565)
[2023-07-25 14:35:04][922565][DEBUG] started result_worker thread for queue: check_results
[2023-07-25 14:35:04][922565][DEBUG] started result_worker thread for queue: check_results
[2023-07-25 14:35:04][922565][DEBUG] started result_worker thread for queue: check_results
<..repeat..>
(50 result workers)
It repeats until it starts accepting jobs again, then after some time it hangs again. And usually workers on worker nodes get vanished after they can't establish connection to gearmanD for some time. Is it possible that gearmanD has some max connection it can accept ? (I have 27 nodes connecting to gearmand, configured each with 50-100 workers, the check count is 6k~ hosts/ 80k~ services, the instability starts happening around 500 job/s submission rate and around 15k checks waiting in queue, the most stable is when there is 150job/s and almost no jobs waiting in queue).
When it hangs and is being restarted, the job/s may be very low it won't make itself stable for some time (until restarted few more times together with naemon), thus in logs here you won't see big numbers.
from this description this could be either some firewall issues or the gearmand hits the number of open files or something like that.
Unfortunately it is not so easy. If there would a firewall problem then local gearman worker wouldn't have problems connecting. By stracing gearmand process itself when it hangs there are lots of Timeouts happening to itself.
For now it stabilized and running smoothly for an hour. I have changed --threads
from 0 to - -thread=10
(0 - used two threads, which may have caused the bottleneck, but too early to tell)
Workers are still vanishing, but at much slower rate. Logs are empty. Gearmand is still running after increase of the threads.
Workers are still vanishing rarely, but gearmand is running very stable, without hanging and crashing for almost a month after --thread
count increase to 10.
Hiya,
We see occasionally (maybe about 10-12 times per year), the gearman-worker-go process just "end" or vanish.
There's nothing in the
~/var/log/mod_gearman/worker.log
when it happens, bothdmesg
and/var/log/syslog
are empty, so it doesn't look like it's segfaulting or anything.Running OMD 4.40 on Debian 10.
worker.conf:
I think we're probably going to see if we can bump
debug
up to1
without filling up the disks etc to see if that will shine any light on what is happening, unless there are any better/other suggestions.