Gearmand crashes or unreachable

rayzilt commented 2 years ago

We recently upgraded two Gearman Job Servers to Debian 11 and are since then experiencing problems. As far as we debugged the situation we have identified two possible bugs.

Bug one: The gearman-job-server crashes with a exit 1. The debug log shows us that it's always after Comparing queue %u to limit %u for priority %u -> libgearman-server/job.cc:175

   INFO 2022-09-11 13:35:03.561151 [  main ] Accepted connection from 192.168.4.253:58160
  DEBUG 2022-09-11 13:35:03.561209 [     1 ] Received CON wakeup event -> libgearman-server/gearmand_thread.cc:628
  DEBUG 2022-09-11 13:35:03.561249 [     1 ]   192.168.4.253:58160 Watching  POLLIN  -> libgearman-server/gearmand_thread.cc:151
  DEBUG 2022-09-11 13:35:03.561263 [     1 ] Gear connection made: 192.168.4.253:58160 -> libgearman-server/plugins/protocol/gear/protocol.cc:409
  DEBUG 2022-09-11 13:35:03.563331 [     1 ] 192.168.4.253:58160 Ready     POLLIN  -> libgearman-server/gearmand_con.cc:138
  DEBUG 2022-09-11 13:35:03.563376 [     1 ] read 260 bytes -> libgearman-server/io.cc:810
  DEBUG 2022-09-11 13:35:03.563390 [     1 ] Gear unpack -> libgearman-server/plugins/protocol/gear/protocol.cc:117
  DEBUG 2022-09-11 13:35:03.563403 [     1 ] GEAR SUBMIT_JOB_BG -> libgearman-server/plugins/protocol/gear/protocol.cc:285
  DEBUG 2022-09-11 13:35:03.563416 [     1 ] Received GEARMAN_SUBMIT_JOB_BG -> libgearman-server/thread.cc:311
  DEBUG 2022-09-11 13:35:03.563427 [     1 ] PACKET COMMAND: GEARMAN_SUBMIT_JOB_BG -> libgearman-server/server.cc:122
  DEBUG 2022-09-11 13:35:03.563438 [     1 ] Received submission, function:cron_cron_p unique:ad508d1e9bc373ff884df2c0c7c6ee74 with 2 arguments -> libgearman-server/server.cc:252
  DEBUG 2022-09-11 13:35:03.563452 [     1 ] Comparing queue 23 to limit 0 for priority 1 -> libgearman-server/job.cc:175
  DEBUG 2022-09-11 13:35:03.757447 [  main ] THREADS: 1 -> libgearman-server/gearmand.cc:274
   INFO 2022-09-11 13:35:03.757664 [  main ] Initializing MySQL module
   INFO 2022-09-11 13:35:03.764573 [  main ] Initializing Gear on port 4730 with SSL: false
   INFO 2022-09-11 13:35:03.768277 [  main ] Starting up with pid 11770, verbose is set to DEBUG
  DEBUG 2022-09-11 13:35:03.768392 [  main ] Method for libevent: epoll -> libgearman-server/gearmand.cc:376
  DEBUG 2022-09-11 13:35:03.768454 [  main ] Trying to listen on 192.168.4.145:4730 -> libgearman-server/gearmand.cc:658
   INFO 2022-09-11 13:35:03.768506 [  main ] Listening on 192.168.4.145:4730 (11)
  DEBUG 2022-09-11 13:35:03.768521 [  main ] Creating wakeup pipe -> libgearman-server/gearmand.cc:928

Currently we are building a version with some extra debug lines (in job.cc) to investigate this.

Bug two: The gearman-job-server uses 100% cpu and doesn't accept any connections from workers or gearadmin. As far as seen kill 9 is the only way to restart the job server. Debug log shows that this cpu usages starts after Gear connection disconnected:. After that only messages about Accepted connection from are shown

DEBUG 2022-09-11 13:39:29.600015 [     1 ] Sent NO_JOB -> libgearman-server/thread.cc:356
  DEBUG 2022-09-11 13:39:29.600037 [     1 ] Received RUN wakeup event -> libgearman-server/gearmand_thread.cc:633
  DEBUG 2022-09-11 13:39:29.600372 [     1 ] 192.168.4.166:53792 Ready     POLLIN  -> libgearman-server/gearmand_con.cc:138
  DEBUG 2022-09-11 13:39:29.600397 [     1 ] read 12 bytes -> libgearman-server/io.cc:810
  DEBUG 2022-09-11 13:39:29.600406 [     1 ] Gear unpack -> libgearman-server/plugins/protocol/gear/protocol.cc:117
  DEBUG 2022-09-11 13:39:29.600413 [     1 ] GEAR PRE_SLEEP -> libgearman-server/plugins/protocol/gear/protocol.cc:285
  DEBUG 2022-09-11 13:39:29.600421 [     1 ] Received GEARMAN_PRE_SLEEP -> libgearman-server/thread.cc:311
  DEBUG 2022-09-11 13:39:29.600428 [     1 ] PACKET COMMAND: GEARMAN_PRE_SLEEP -> libgearman-server/server.cc:122
  DEBUG 2022-09-11 13:39:29.600438 [     1 ]   192.168.4.166:53792 Watching  POLLIN  -> libgearman-server/gearmand_thread.cc:151
  DEBUG 2022-09-11 13:39:30.380466 [  main ] accept() fd:94 -> libgearman-server/gearmand.cc:886
   INFO 2022-09-11 13:39:30.380664 [  main ] Accepted connection from 192.168.4.143:47734
  DEBUG 2022-09-11 13:39:30.380779 [     1 ] 192.168.4.143:57684 Ready     POLLIN  -> libgearman-server/gearmand_con.cc:138
   INFO 2022-09-11 13:39:30.380832 [     1 ] Peer connection has called close()
   INFO 2022-09-11 13:39:30.380848 [     1 ] Disconnected 192.168.4.143:57684
   INFO 2022-09-11 13:39:30.380923 [     1 ] Gear connection disconnected: -:-
  DEBUG 2022-09-11 13:39:30.576218 [  main ] accept() fd:95 -> libgearman-server/gearmand.cc:886
   INFO 2022-09-11 13:39:30.576297 [  main ] Accepted connection from 192.168.4.143:47740
  DEBUG 2022-09-11 13:39:30.754681 [  main ] accept() fd:96 -> libgearman-server/gearmand.cc:886
   INFO 2022-09-11 13:39:30.754775 [  main ] Accepted connection from 192.168.4.72:46614
  DEBUG 2022-09-11 13:39:31.218241 [  main ] accept() fd:97 -> libgearman-server/gearmand.cc:886
   INFO 2022-09-11 13:39:31.218321 [  main ] Accepted connection from 192.168.4.142:38112
  DEBUG 2022-09-11 13:39:31.218457 [  main ] accept() fd:98 -> libgearman-server/gearmand.cc:886
   INFO 2022-09-11 13:39:31.218474 [  main ] Accepted connection from 192.168.4.165:44088
  DEBUG 2022-09-11 13:39:31.278983 [  main ] accept() fd:99 -> libgearman-server/gearmand.cc:886
   INFO 2022-09-11 13:39:31.279068 [  main ] Accepted connection from 192.168.4.166:48492
  DEBUG 2022-09-11 13:39:31.457952 [  main ] accept() fd:100 -> libgearman-server/gearmand.cc:886
   INFO 2022-09-11 13:39:31.458045 [  main ] Accepted connection from 192.168.4.165:44098

We have used the packages that Debian provide us, but also build our own packages from master. We use PHP and Perl workers. Server version: 1.1.19.1 libgearman: 1.1.19.1 libgearman-client-perl: 2.004.015-1

Maybe this two bugs are connected to each other.

Anybody have seen this behavior before?

esabol commented 2 years ago

Anybody have seen this behavior before?

Bug 1: Nope. But I think it's related to your usage of a persistent storage plugin. I've never seen that "Comparing queue %u to limit %u for priority %u" message in my usage, but I don't use any of the persistent storage plugins. Looking at the code, I'm guessing you're hitting the GEARMAND_JOB_QUEUE_FULL condition? Not sure.

We try to discourage usage of the persistent storage plugins these days. The consensus among the gearmand developers is that they were a bad idea. We still accept PRs for maintaining the persistent storage layers, but we've stopped active development on them. It's better to use a design pattern where job persistence is implemented by workers. It scales better and is generally more robust. (There are two frameworks for implementing such a system, one is called Gearstore and another is called Garavini. You might want to look into them. It's also straighforward to implement your own persistent storage tasks once you understand the design pattern.)

Bug 2: This sounds like issue #232. There's a prospective patch in that issue. Please try it and let us know if it resolves the issue.

Also, if upgrading your OS broke gearmand, I would recommend running your gearmand in a Docker container on whatever OS version worked for you in the past. See PR #327 for the Dockerfiles I use.

rayzilt commented 2 years ago

Thank you for your suggestions.

We possibly identified two workers that seems to trigger both condition. We disabled them and since then gearmand is running stable. We still need to investigate these workers further.

I'm not sure if we hit GEARMAND_JOB_QUEUE_FULL condition, we had some debugging in place but have not yet fully debugged the situation as we think we found the trigger of both situations.

We have put multiple debug log lines in job.cc to find out where exactly gearmand went before going down with exit 1.

Section of the log with the debug line:

  DEBUG 2022-09-12 12:26:00.817004 [     1 ] Comparing queue 95 to limit 0 for priority 1 -> libgearman-server/job.cc:175
  DEBUG 2022-09-12 12:26:00.817011 [     1 ] Line: 186 -> libgearman-server/job.cc:186
  DEBUG 2022-09-12 12:26:01.028396 [  main ] THREADS: 1 -> libgearman-server/gearmand.cc:274
   INFO 2022-09-12 12:26:01.028593 [  main ] Initializing MySQL module

Section of job.cc:

The log indicates that gearmand went to server_job= gearman_server_job_create(server);. That all we have for now.

If we find the issue in those workers, I'll post this in this issue for further reference.

esabol commented 2 years ago

It's not outside the realm of possibility that poorly implemented clients and workers can cause problems with the gearmand server. If that's the case, it's a flaw and any help in tracking down what the problem might be would be appreciated. :smile:

Were there any changes to these workers when you migrated to Debian 11? Or are these new workers?

Also, what OS are you migrating from (where presumably everything worked correctly)?

gearman / gearmand

Gearmand crashes or unreachable #345