brianlmoon / net_gearman

A PHP interface for Gearman
84 stars 46 forks source link

Suddenly gearman workers start refusing jobs but works fine after restart. #37

Closed softobizadmins closed 6 years ago

softobizadmins commented 6 years ago

Hi Brain

We are running an application for which we are using gearman. We are facing a strange issue with workers, We have 6 jobs created for the workers, which workers process nicely but out of a sudden couple of the job's stops getting processed or workers don't accept any requests for them but rest of the jobs are behaving normally. I have exceeded the verbose level for gearman-manger to crazy (-vvvvv) to track the issue but I was not able to find any clue in the logs of gearman-manager. After that, i have turned on the DEGUG log level for gearmanD but still, I was not able to track the issue.

Everything works fine after I restart the gearman workers. I have pasted the gearmand and gearman-manager config file below. Please help me to identify the issue. Thanks.

########################## gearman-manager/config.ini ########################## [GearmanManager]

; workers can be found in this dir

host=127.0.0.1 worker_dir=/etc/gearman-manager/workers

; Reload workers as new code is available ;auto_update=1

; 3 workers will do all jobs count=3

; Workers will only live for 25 Minutes. max_worker_lifetime=600


#################### gearmand.conf #################### description "Gearmand Server"

start on started mysql stop on runlevel [016] kill timeout 3 respawn exec gearmand \ --listen=127.0.0.1 \ --log-file=/var/log1/gearmand/gearmand.log \ --verbose=DEBUG \ --job-retries=2 \ --queue-type=MySQL \ --mysql-host=localhost \ --mysql-port=3306 \ --mysql-user=gearman \ --mysql-password=gearman \ --mysql-db=gearman \ --mysql-table=gearman_01 \ 2>> /var/log/gearmand.log ~

brianlmoon commented 6 years ago

The first thing I would recommend is to change the config to use dedicated workers for the jobs that are not getting done. If dedicated workers still don't get them done, then something else may be going on.

softobizadmins commented 6 years ago

Thanks for the reply. I have observed the same pattern of logs in the gearman-manager.log at the time of issue when processing gets stuck suddenly for some specific function on the worker server. Please suggest your views on this.

[2018-02-05 15:54:09.733015] 14702 PROC Started child 17607 (job1,job2,job3,job4,job5) [2018-02-05 15:54:09.733114] 14702 DEBUG Registering signals for child [2018-02-05 15:54:09.733408] 17607 DEBUG Adjusted max run time to 647 seconds (max_worker_lifetime:600 + splay:47) [2018-02-05 15:54:09.733512] 17607 WORKER Adding server 127.0.0.1 [2018-02-05 15:54:09.733604] 17607 WORKER Adding job job1 ; timeout: [2018-02-05 15:54:09.733665] 17607 WORKER Adding job job2 ; timeout: [2018-02-05 15:54:09.733716] 17607 WORKER Adding job job3 ; timeout: [2018-02-05 15:54:09.733765] 17607 WORKER Adding job job4 ; timeout: [2018-02-05 15:54:09.733813] 17607 WORKER Adding job job5 ; timeout: [2018-02-05 15:54:18.765310] 14702 PROC Child 17606 exited with error code of 0 (job1,job2,job3,job4,job5)

brianlmoon commented 6 years ago

Does another worker not get started? What does your worker code look like?

brianlmoon commented 6 years ago

This looks odd:

WORKER Adding job job2 ; timeout:

There is no timeout there. If there is not timeout, the logs don't add the timeout: label. e.g.

21468 WORKER Adding job Reverse_String

I am guessing something in your config is setting a bad timeout and perhaps there is some bug in the code that is treating that as zero. I can't replicate that with the config you showed above.

brianlmoon commented 6 years ago

No new feedback