Open rayzilt opened 2 years ago
Anybody have seen this behavior before?
Bug 1: Nope. But I think it's related to your usage of a persistent storage plugin. I've never seen that "Comparing queue %u to limit %u for priority %u" message in my usage, but I don't use any of the persistent storage plugins. Looking at the code, I'm guessing you're hitting the GEARMAND_JOB_QUEUE_FULL
condition? Not sure.
We try to discourage usage of the persistent storage plugins these days. The consensus among the gearmand developers is that they were a bad idea. We still accept PRs for maintaining the persistent storage layers, but we've stopped active development on them. It's better to use a design pattern where job persistence is implemented by workers. It scales better and is generally more robust. (There are two frameworks for implementing such a system, one is called Gearstore and another is called Garavini. You might want to look into them. It's also straighforward to implement your own persistent storage tasks once you understand the design pattern.)
Bug 2: This sounds like issue #232. There's a prospective patch in that issue. Please try it and let us know if it resolves the issue.
Also, if upgrading your OS broke gearmand, I would recommend running your gearmand in a Docker container on whatever OS version worked for you in the past. See PR #327 for the Dockerfiles I use.
Thank you for your suggestions.
We possibly identified two workers that seems to trigger both condition. We disabled them and since then gearmand is running stable. We still need to investigate these workers further.
I'm not sure if we hit GEARMAND_JOB_QUEUE_FULL
condition, we had some debugging in place but have not yet fully debugged the situation as we think we found the trigger of both situations.
We have put multiple debug log lines in job.cc to find out where exactly gearmand went before going down with exit 1.
Section of the log with the debug line:
DEBUG 2022-09-12 12:26:00.817004 [ 1 ] Comparing queue 95 to limit 0 for priority 1 -> libgearman-server/job.cc:175
DEBUG 2022-09-12 12:26:00.817011 [ 1 ] Line: 186 -> libgearman-server/job.cc:186
DEBUG 2022-09-12 12:26:01.028396 [ main ] THREADS: 1 -> libgearman-server/gearmand.cc:274
INFO 2022-09-12 12:26:01.028593 [ main ] Initializing MySQL module
Section of job.cc:
The log indicates that gearmand went to server_job= gearman_server_job_create(server);
.
That all we have for now.
If we find the issue in those workers, I'll post this in this issue for further reference.
It's not outside the realm of possibility that poorly implemented clients and workers can cause problems with the gearmand server. If that's the case, it's a flaw and any help in tracking down what the problem might be would be appreciated. :smile:
Were there any changes to these workers when you migrated to Debian 11? Or are these new workers?
Also, what OS are you migrating from (where presumably everything worked correctly)?
We recently upgraded two Gearman Job Servers to Debian 11 and are since then experiencing problems. As far as we debugged the situation we have identified two possible bugs.
Bug one: The gearman-job-server crashes with a exit 1. The debug log shows us that it's always after
Comparing queue %u to limit %u for priority %u -> libgearman-server/job.cc:175
Currently we are building a version with some extra debug lines (in job.cc) to investigate this.
Bug two: The gearman-job-server uses 100% cpu and doesn't accept any connections from workers or gearadmin. As far as seen
kill 9
is the only way to restart the job server. Debug log shows that this cpu usages starts afterGear connection disconnected:
. After that only messages aboutAccepted connection from
are shownWe have used the packages that Debian provide us, but also build our own packages from master. We use PHP and Perl workers. Server version: 1.1.19.1 libgearman: 1.1.19.1 libgearman-client-perl: 2.004.015-1
Maybe this two bugs are connected to each other.
Anybody have seen this behavior before?