gearman / gearmand

http://gearman.org/
Other
736 stars 138 forks source link

Stuck Jobs #343

Open steevhise opened 2 years ago

steevhise commented 2 years ago

We've been using Gearman for years and years, but lately, like over the last 6 months or so, we sometimes see jobs get "stuck". They will sit on a queue for hours or days if left alone, as soon as gearmand is restarted, or one of the workers for that queue, they job goes out right away... worker logs show no intital attempts to do the job, a theory that the worker died and the server still thinks it's working on the job is not the case - the server just never hands out the job, apparently.

I've always thought of Gearmand as this rock-solid piece of infrastructure that I don't have to worry about and the problem is always with our client or worker code, but this seems like something else....

What could cause this? Is it related to #301 ?

esabol commented 2 years ago

Well, are you setting timeouts for any jobs? #301 only applies if you are using timeouts. If you are, then my advice is don't. If you don't, then you shouldn't encounter that problem with gearmand. Personally, I have my workers wrap commands which do heavy work and may potentially exceed certain time constraints with /usr/bin/timeout and then report an error condition to the client. See man timeout. Just my two cents.

Other questions needed for clarification:

  1. What version of gearmand are you using? (Always a good idea to report the version of the software you are using!)
  2. Have you changed versions of gearmand at all in "the last 6 months or so"?
  3. Is there anything else you've changed in "the last 6 months or so"? E.g., how your clients or workers are implemented, operating system, etc.
steevhise commented 2 years ago
  • What version of gearmand are you using? (Always a good idea to report the version of the software you are using!)

the latest. 1.1.19.1+ds

  • Have you changed versions of gearmand at all in "the last 6 months or so"?

no. it's been over 2 years since the latest came out. right? :) of course in that time there's been multiple major debian upgrades, etc etc. but no, no changes to gearmand.

  • Is there anything else you've changed in "the last 6 months or so"? E.g., how your clients or workers are implemented, operating system, etc.

Now THAT's a good question. Besides the debian upgrades mentioned above, starting a year ago we started submitting jobs from a new application, and added 2 new queues/functions. It's unclear but it seems the "stuck" jobs are only happening with those 2 new queues. So one theory is that the new client software is constructing the job data in some problematic way. but it's so intermittent that it seems like only part of the explanation...

btw the "6 months or so" is so vague because reports of this come from a junior sysadmin with a literal brain injury and he has a terrible memory problem and so it's just not clear how long this has been going on. It might have been for a year. hard to say.

More and more i'm thinking it must be a bug in our client code but i just wanted to check that there's not some known issue with jobs "sticking," like something the job data is known for causing gearmand to choke on or something. like "definitely don't ever put a unicode umlaut in a job" or somesuch...

Anyway i'm so happy someone is actively dealing with this repo! thanks.

esabol commented 2 years ago

Appreciate all the answers, but you didn't answer the most important question of whether you're setting timeouts on any of your jobs. Again, I recommend that you don't since that's a known (and still unresolved) issue.

Is there anything notable in your gearmand.log file when a job gets stuck?

gearmand uses the OS's libevent*.so libraries for important aspects of it operation, so any changes to libevent could impact gearmand. So that's always a possibility, I feel, when OS upgrades are involved.

You could try downgrading to gearmand 1.1.18, I suppose. Frankly, that's the version I've been using in production for ~4 years now, and it's been rock solid. I haven't had any issues, so I never bothered to upgrade my production environment to 1.1.19.1. (I do use 1.1.19.1 in my development environment though and haven't encountered any problems there either.) That said, I don't do non-basic things like setting timeouts on jobs or using any of the (quasi-deprecated) persistence plugins or dynamically scaling the number of workers.

If a worker is disconnected from gearmand (i.e., killed) while a job is in the middle of processing, I think it might be possible for the job to get stuck in that scenario. Can you test that? I recommend avoiding killing workers while they are in the middle of working on a job and only killing workers when they are idle. Sometimes you have to, of course. I'm just saying I would try to avoid that whenever feasible.

esabol commented 2 years ago

Also, just to clarify, are you seeing high CPU load when a job becomes "stuck"? (If so, that could be issue #232.)

Have you implemented any signal handlers in your workers and are you sending signals to your workers? (If so, issue #230 could be related.)

steevhise commented 2 years ago

I don't think we're using timeouts. They are not set by default, right?

You've given us several interesting things to think about, thank you.

A few other answers:

esabol commented 2 years ago

I don't think we're using timeouts. They are not set by default, right?

Correct.

we have a daemon with a parent process that spawns child workers, and kills them when they're idle and have finished a certain number of jobs.

I do the same thing, I think. I have my workers die after processing a certain number of jobs and then I spawn new workers to replace them.

steevhise commented 2 years ago

So upon further digging into our code and adding more logging, I found that that part of the problem is that sometimes the worker child was dying quietly and never returning info to the gearman server, so the job never gets "released." And it was dying because of some "garbage" characters in the data it gets from gearman. oddly tho, now that i've changed that code to catch the error and return properly, the job gets tried a second time and works just fine., so that "garbage" is not inherent to the job data. is it possible that gearmand is putting it at the end of the data, but only intermittently? the garbage data is: a newline, then the characters "RES", and then another newline and a singlequote and a semicolon. these characters are not visible in the gearman persistent store (we're using sqllite) data for the job.

SpamapS commented 2 years ago

Could you post a full packet dump? RES should always be preceded by a null:

4 byte magic code - This is either "\0REQ" for requests or "\0RES" for responses

See http://gearman.org/protocol/

On Fri, Jul 8, 2022, 12:12 PM Steev Hise @.***> wrote:

So upon further digging into our code and adding more logging, I found that that part of the problem is that sometimes the worker child was dying quietly and never returning info to the gearman server, so the job never gets "released." And it was dying because of some "garbage" characters in the data it gets from gearman. oddly tho, now that i've changed that code to catch the error and return properly, the job gets tried a second time and works just fine., so that "garbage" is not inherent to the job data. is it possible that gearmand is putting it at the end of the data, but only intermittently? the garbage data is: a newline, then the characters "RES", and then another newline and a singlequote and a semicolon. these characters are not visible in the gearman persistent store (we're using sqllite) data for the job.

— Reply to this email directly, view it on GitHub https://github.com/gearman/gearmand/issues/343#issuecomment-1179288920, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADS6YA573YUIBHQDGBQTXLVTB4S7ANCNFSM52XCWQ3Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

steevhise commented 2 years ago

This is so intermittent that it's difficult to get a packet dump - is there a way to make Gearmand log raw packets it sends?

From our worker logs it looks a lot like sometimes there's a RES at the end of job data, and sometimes not... but actually maybe this is an issue with the perl Gearman::Worker module (our worker code uses that to connect to the server)... maybe it's not chopping off the magic code from the end of the data properly, sometimes?

thanks!

esabol commented 2 years ago

but actually maybe this is an issue with the perl Gearman::Worker module (our worker code uses that to connect to the server)... maybe it's not chopping off the magic code from the end of the data properly, sometimes?

That would be difficult for me to believe. Are you using Gearman::Worker version 2.004.015? 2.004.013 (from 2018-02-05) fixed a bug with Gearman::Util::read_res_packet that could be related, if you're not.

That said, all of my workers are implemented using Gearman::Worker (2.004.015), and I've never seen that in many thousands of processed jobs over 4+ years. That said, I'm using SSL connections and passing strictly ASCII payloads. Is there a possibility some of your jobs have UTF-8 payloads? (Just spit-ballin'.)

steevhise commented 2 years ago

Is there a possibility some of your jobs have UTF-8 payloads? (Just spit-ballin'.)

That is totally possible. But the real mystery is why would it only happen sometimes? If there's bad data, it seems like it would have the same problem each time it tries the job. but the second try always sails through no problem...

We'll check the Gearman::Worker version, tomorrow. thanks!

azzmin commented 2 years ago

That's a good point about payloads. Not sure if it helps but as a Gearman fan and really happy it's being continued. @esabol is an absolute legend. So I will add my 50 cents. The way to avoid any issues on any payloads is to send the payload to file storage and reference it by a GUID. You have to remember what Gearman is for. It's not it's job of handling payloads... It's to pass reference to a payload to use. @steevhise good luck and hope you get it sorted.

esabol commented 2 years ago

That's very kind of you to say, @tomcoxcreative, but I'm standing on the shoulders of giants and @SpamapS is much, much taller than me. To be honest, multithreaded event loop programming isn't my forté. I'm really only good at fixing portability issues and analyzing bug reports. :smile:

@steevhise wrote:

But the real mystery is why would it only happen sometimes? If there's bad data, it seems like it would have the same problem each time it tries the job. but the second try always sails through no problem...

Yeah, I don't know. Maybe the UTF-8 encoding is invalid and that could screw up the parsing? I'm just trying to find differences in what you're doing and what I'm doing and I know handling Unicode in Perl can be tricky. I recently watched this talk about that from the most recent Perl Conference: https://youtu.be/yH5IyYyvWHU. I base64-encode a lot of my job payloads to avoid having to deal with non-ASCII characters.

steevhise commented 2 years ago

So far looking at the jobs that have been stuck, i haven't seen any unicode. One additional trickiness about our use case is that these 2 new queues are supplied jobs by Nodejs clients, whereas previously everything was perl. we're using Abraxas as the node client library. One other thing different is that these new queues' jobs contain a lot more data, so @tomcoxcreative 's point is well taken. previous queues only put a single id that gets looked up in a database by the workers. these new ones have lots of unique data encoded as JSON. Maybe that's a mistake, but I didn't see any warnings against that in the docs, IIRC... Thanks everyone for all the help, and I echo Tom's sentiments, I'm really glad there's such talented folks still working on Gearman!

SpamapS commented 2 years ago

On Mon, Jul 11, 2022 at 1:49 PM Steev Hise @.***> wrote:

So far looking at the jobs that have been stuck, i haven't seen any unicode. One additional trickiness about our use case is that these 2 new queues are supplied jobs by Nodejs clients, whereas previously everything was perl. we're using Abraxas as the node client library.

That the jobs come from newer/different clients could be a clue. Check to make sure that client doesn't set timeouts silently, and also check what REQ commands it is sending. Gearmand shouldn't allow a weird or newish client to break older, stable workers but it's possible that something deep in the protocol isn't mapping properly.

One other thing different is that these new queues' jobs contain a lot more data, so @tomcoxcreative https://github.com/tomcoxcreative 's point is well taken. previous queues only put a single id that gets looked up in a database by the workers. these new ones have lots of unique data encoded as JSON. Maybe that's a mistake, but I didn't see any warnings against that in the docs, IIRC...

I think you're ok with 1MB or less of data. It just gets ugly because each bit of data is a contiguous section of memory from the heap and that can get pretty expensive especially if your workload is intermittent.

Thanks everyone for all the help, and I echo Tom's sentiments, I'm really glad there's such talented folks still working on Gearman!

I'm just here to ship releases, something I haven't done lately but hopefully will get going again soon. In the meantime, @esabol is a humble rockstar and I think I speak for both of us when I say we really appreciate the kind words and your feedback as users.

— Reply to this email directly, view it on GitHub https://github.com/gearman/gearmand/issues/343#issuecomment-1180855619, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADS6YA3LELUATHQMHRNFDLVTSCENANCNFSM52XCWQ3Q . You are receiving this because you were mentioned.Message ID: @.***>

azzmin commented 2 years ago

@steevhise if you are sending json in the messages then it will be very difficult to measure your payload size. Like @SpamapS mentions the memory use for large payloads will end up exceeding your use case and probably randomly as you are seeing. So in front of Gearman if you can send payload to file storage then read in the payload from your Gearman trigger you shouldn't see the random issues around the payload size. Also you can tidy up by using the OS's tmp directory or removing the payloads as part of your trigger.

derde commented 2 years ago

I got stuck jobs very simply:

My test function is called 'diameter': while true ; do gearman -f diameter <<< "mooa" & gearman -f diameter <<< "mooa" &gearman -f diameter <<< "mooa" & gearman -f diameter <<< "mooa" & wait ; done It locks up after 2 seconds of normal behaviour answered from random threads. Lock-up behaviour is that queries (e.g. gearadmin --workers) at random either get a stuck thread, or make progress.

The work-around is to run gearmand without threads (gearmand -t 0 or gearmand -t 1).

Update: that was on 1.1.18+ds-3.1ubuntu1 This particular thread lockup is no longer present on 1.1.19.1+ds-2build1

esabol commented 2 years ago

This particular thread lockup is no longer present on 1.1.19.1+ds-2build1

So we're good then? 😄

My guess as to why you would see this issue with 1.1.18 but not 1.1.19.1 is this commit: https://github.com/gearman/gearmand/commit/45bd95402d71c583c4737496cc0b48930d548c14

steevhise commented 2 years ago

So we're good then? 😄

Well my case is using 1.1.19.1 so must be 2 different issues.

And, we're still seeing it and still haven't figured out why. Additional weirdness observed: sometimes a stuck job seems to make the whole server get wacked out and hold onto other queues' jobs too or just not hand them out as fast (different than the function where the first stuck job appeared).

Also while investigating this I noticed that sometimes a stuck job could be taken by a worker (and child thread is just hanging doing nothing) but also still on the queue. My understanding of the protocol is that this wouldnt happen, that a job taken by worker would mean the server takes it out of the persistent store... Am I wrong?

esabol commented 2 years ago

"So we're good then?" was directed to @derde. Not dismissing your issue @steevhise !

My understanding of the protocol is that this wouldnt happen, that a job taken by worker would mean the server takes it out of the persistent store... Am I wrong?

I don't think you're wrong, but I do wonder if the problem you're seeing is a problem with the persistent storage layer and the whole interaction there. As you may know, the persistent storage support in gearmand has been quasi-deprecated in favor of a design pattern where job persistence is implemented by workers. It scales better and is generally more robust. (There are two frameworks for implementing such a system, one is called Gearstore and another is called Garavini. You might want to look into them. It's also straightforward to implement your own persistent storage tasks once you understand the design pattern.) I say "quasi-deprecated" as they are not officially deprecated and we still accept PRs for maintaining the persistent storage layers, but we've stopped active development on them.

steevhise commented 2 years ago

For us it's scaled pretty well to just have multiple gearman servers (other than this particular problem). and we have some database things to make sure the same job never gets given to more than one of those. it seems against the basic idea of a queue server to offload storage of the jobs to the workers. i would think that the "authority" that hands out jobs should also be the arbiter of what each job is. otherwise if a worker takes a job and screws it up, it just goes away, potentially, and the server never knows what happened. Sorry, going off on philosphical tangent...

SpamapS commented 2 years ago

Gearman isn't really a queue server, it just gets used that way because background jobs do queue. The state backends were added in part because so many people misused gearman and painted themselves into a corner with background jobs.

However, there are very few guarantees in the protocol, as it was designed for efficient job distribution and low latency, not guarantees typically offered by pure messaging systems, it can't really do "at least one" or "only once".

The protocol does have a way to use multiple gearman servers without assignment from a database, which is up hash the unique ID on the client side. When you submit a job, if the unique already exists, no new job is created and the same handle will be returned to the client. This is one of gearman's super powers. You can read more here:

http://fewbar.com/2020/04/gearman-is-misunderstood/

On Tue, Sep 6, 2022, 10:54 AM Steev Hise @.***> wrote:

For us it's scaled pretty well to just have multiple gearman servers (other than this particular problem). and we have some database things to make sure the same job never gets given to more than one of those. it seems against the basic idea of a queue server to offload storage of the jobs to the workers. i would think that the "authority" that hands out jobs should also be the arbiter of what each job is. otherwise if a worker takes a job and screws it up, it just goes away, potentially, and the server never knows what happened. Sorry, going off on philosphical tangent...

— Reply to this email directly, view it on GitHub https://github.com/gearman/gearmand/issues/343#issuecomment-1238477603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADS6YA3ZKAMGRP3Z6RTXK3V46AOFANCNFSM52XCWQ3Q . You are receiving this because you were mentioned.Message ID: @.***>

SpamapS commented 1 year ago

My understanding of the protocol is that this wouldnt happen, that a job taken by worker would mean the server takes it out of the persistent store... Am I wrong?

I missed this the first time. Yes you are mistaken.

The point of the persistent store is to hold on to the job until it has been completed in case there is a loss of state in the gearmand itself. When background jobs are assigned to workers, they are marked as such only in memory on the gearmand server. They are only removed from the persistent queue when they are done, which is only when a WORK_FAIL, WORK_COMPLETE, or WORK_EXCEPTION packet has been received from a worker, or the job has exceeded the retry count. And likewise, they are only retrieved from the queue backend if gearmand is starting up, trying to recover its in-memory state as best it can.

HBSpy commented 1 year ago

I also meet this kind of intermittently stuck.

gearadmin --status | grep update_user update_user 15 0 1

The worker is idle, but server does not wake up the worker or send job to the worker. GDB shows the worker strace as follow

(gdb) bt
#0  0x00007f479d729dc0 in __poll_nocancel () from /lib64/libc.so.6
#1  0x00007f477fba9c93 in gearman_wait(gearman_universal_st&) () from /lib64/libgearman.so.8
#2  0x00007f477fbac891 in gearman_worker_grab_job () from /lib64/libgearman.so.8
#3  0x00007f477fbad377 in gearman_worker_work () from /lib64/libgearman.so.8
#4  0x00007f477fdcfd7d in zif_gearman_worker_work () from /usr/lib64/php/modules/gearman.so
#5  0x000000000064eadc in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER ()
#6  0x00000000005fcb1b in execute_ex ()

Even I run the worker with gearman -f update_user -w, I still stuck sometime.

Totally comfused...

[root@WRDNMS tasks]# gearmand  -V

gearmand 1.1.20 - https://github.com/gearman/gearmand/issues
[root@WRDNMS tasks]# php --ri gearman

gearman

gearman support => enabled
extension version => 2.0.6
libgearman version => 1.1.19.1
Default TCP Host => localhost
Default TCP Port => 4730

企业微信截图_bd5dbbf3-e767-4cf8-b438-70d276410e9b

esabol commented 1 year ago

Hello, @HBSpy!

libgearman version => 1.1.19.1

That's out of date. Our latest release is 1.1.20.

Your PHP extension is also very out of date. I recommend using the master branch of the https://github.com/php/pecl-networking-gearman. (Don't use the current PECL release since it's old and does not include many bug fixes in the unreleased master branch.)

git clone https://github.com/php/pecl-networking-gearman
phpize
./configure
make

Also, your backtrace is for the worker. What is actually relevant to us is the server.

If you are up to compiling your own gearmand and libgearman, please try this patch and report your findings here: https://georgi.unixsol.org/programs/gearmand-cpu-spinning.patch

HBSpy commented 1 year ago

Thx, @esabol

With all components including the gearmand-cpu-spinning-patch updated, It still stuck

php --ri gearman

gearman

gearman support => enabled
extension version => 2.1.0
libgearman version => 1.1.20
Default TCP Host => localhost
Default TCP Port => 4730

Every 1.0s: gearadmin --status | grep update_user                               Mon Jul 10 11:01:54 2023

update_user     17  0   1

Other jobs are normal, so gdb or strace to gearmand is quite a mass.

Temporarily I add a setTimeout to the PHP worker, and worker knowns itself is timeout and continue the loop, and recevies jobs again.

        $worker = new GearmanWorker();
        $worker->addServer($this->config->gearman->host);
        $worker->setId('user');
        $worker->setTimeout(30 * 1000);
        $worker->addFunction('update_user', [$this, 'update_user_func']);
        do {
            $worker->work();
            $returnCode = $worker->returnCode();
            if ($returnCode == GEARMAN_TIMEOUT) {
                    echo "timeout" . PHP_EOL;
                    continue;
            }
            if ($returnCode != GEARMAN_SUCCESS) {
                    echo "returncode:" . $returnCode . PHP_EOL;
                    break;
            }
        } while(true);
esabol commented 1 year ago

I strongly do not recommend setting a timeout. Timeouts are kind of broken. That's a separate, known issue: #301.

But maybe the PHP setTimeout() is a different thing. If it's implemented at the PHP level, it might be ok to use that. But if it uses the gearmand server to implement the worker timeout, that's problematic.

tsmgeek commented 5 months ago

@esabol the implementation calls gearman_worker_set_timeout from what I can see, does that mean its using the GM implementation of it?

https://github.com/wcgallego/pecl-gearman/blob/3cebb6405f2c23baf709972edd7b7be131f21880/php_gearman_worker.c#L232

If that is the case that we cant use it, does that not present an issue...

        $gmworker->setTimeout($gminterval);

        gc_disable();

        $strat = new \STS\Backoff\Strategies\LinearStrategy(1);
        $strat_count = 1;

        Signal::init();

        while($gmworker->work() || $gmworker->returnCode() == GEARMAN_TIMEOUT){
            if (Signal::isExit()) {
                $logger->log($this->getScriptName()."|EXIT CALLED via SIGNAL", Zend_Log::INFO);
                $this->doExit(0, $exitCallback);
            }elseif($maxrun>0 && (($this->started->getDuration() / 1000) > $maxrun)){
                $logger->log($this->getScriptName()."|TIMED EXIT", Zend_Log::INFO);
                $this->doExit(3, $exitCallback);
            }

            switch($gmworker->returnCode()){
                case GEARMAN_NO_ACTIVE_FDS:
                    sleep($strat->getWaitTime($strat_count));
                    $strat_count++;
                    break;
                case GEARMAN_TIMEOUT:
                    $strat_count=0;
                    if(is_callable($timeoutCallback)) {
                        $timeoutCallback();
                        gc_collect_cycles();
                    }
                    break;
                case GEARMAN_SUCCESS:
                case GEARMAN_WORK_FAIL:
                case GEARMAN_WORK_EXCEPTION:
                    $strat_count=0;
                    \Zend_Registry::getInstance()->get('db')->closeConnection();

                    // tidy after completion
                    gc_collect_cycles();

                    // single process
                    if (isset($options['single'])) {
                        exit(0);
                    }
                    break;
                default:
                    $logger->log($this->getScriptName()."|retcode: ".$gmworker->returnCode(), Zend_Log::ALERT);
                    if(is_callable($exitCallback)) $exitCallback();
                    break 2;
            }
        }

This is example of how we use the timeout so it does not hang on ->work() when there are no jobs and that we can exit when signals are called or if timeouts are set also handing other scenarios.

image
esabol commented 5 months ago

@esabol the implementation calls gearman_worker_set_timeout from what I can see, does that mean its using the GM implementation of it?

Sure. But I might have confused this timeout with CAN_DO_TIMEOUT tasks? I don't remember anymore if those are the same things or not. CAN_DO_TIMEOUT is what is broken.

Personally, I do not use any gearmand-level timeouts, and, after having tried to debug issue #301 for many days, I advise everyone else to not use gearmand's timeouts either. What I did instead was that I implemented timeouts for jobs in my workers, rather than relying on gearmand to do it. The worker itself is responsible for enforcing a timeout on the work being done (often by simply using /usr/bin/timeout to execute another program that does the actual work). If the job takes too long to finish, the worker stops working on the job and returns an error code/message to the client. It's the client's responsibility to then interpret the error. It's worked well for us for many years.

Anyway, that's just what I've done to avoid gearmand's problematic implementation of timeouts and/or jobs taking forever. YMMV.

Frankly, I couldn't determine, after reading your post multiple times, if you were reporting a problem with timeouts or reporting that timeouts are working well for you.