brianlmoon / net_gearman

A PHP interface for Gearman
84 stars 46 forks source link

socket_set_block causing close latency #34

Closed caffeinejolt closed 6 years ago

caffeinejolt commented 6 years ago

I have been a long time user of and much older version of this library. Recently I started messing with the most recent and noticed it is much slower when running side by side (same gearman server and everything else). When I stepped through the code, I found that the culprit was socket_set_block() inside of Connection::close

When I comment that out, it runs without the 1-2 second close() delay. Of course, who knows what the collateral damage is from commenting it out :-)

brianlmoon commented 6 years ago

You may end up with a lot of CLOSE_WAIT connections on the gearmand server. What is your use case where the connection close time is a performance issue? Our workers run for an hour, staying connected the whole time.

caffeinejolt commented 6 years ago

My workers do stay connected most of the time. Its the clients (or whatever they are called in the gearman world - the job submitters) that connect/disconnect frequently. The line causes the disconnection to take an extra 1-2 seconds for the clients (end-users see the latency). I see the comment here: http://php.net/manual/en/function.socket-close.php recommending the approach you take and that does make sense. I may just end up sticking with the older version that did not support non-blocking ops since you are right - I could end up with a lot of CLOSE_WAIT connections.

I wish I had more constructive feedback - I have never dealt with non-blocking sockets. Do you see the disconnection lag with/without that line when you run clients?

brianlmoon commented 6 years ago

I did not think about the client side. I do not see that lag. What OS and PHP version are you using?

caffeinejolt commented 6 years ago

Ok - I dug some further and am not sure the issue is specifically socket_set_block any more. However, there is a speed issue compared to the old version I run. Here is the difference in run times with only swapping out the version of net_gearman:

New: real 0m1.161s user 0m0.011s sys 0m0.005s

Old (uses blocking sockets): real 0m0.150s user 0m0.015s sys 0m0.006s

I am not sure what is causing the slowdown - but it seems to be somewhere in Connection::close - here is some profiling I ran: https://drive.google.com/file/d/0B02yOk4TFZH9RHd5SGQ2bTAxSDg/view?usp=sharing

So when I remove socket_set_block, it spends no time in the Connection::close do/while loop whereas when socket_set_block is used, it spends 88% of total time in the Connection::close do/while loop. So the major slowdown is here in Connection::close:

        do {
            $buf = @socket_read($this->socket, 8192);
        } while (strlen($buf) > 0);

When I remove those three lines - it runs much faster:

real 0m0.255s user 0m0.014s sys 0m0.006s

So narrowing things down a bit more!

caffeinejolt commented 6 years ago

PHP 5.3.8

caffeinejolt commented 6 years ago

Linux 2.6

brianlmoon commented 6 years ago

PHP 5.3.8 is quite old. It has been end of life for over 3 years. I think I have not used it for over 5 years. This code was definitely not built or tested against it. I should update the composer requirements to indicate a minimum version of 5.6.

If you are stuck on PHP 5.3, you could try removing the @ on $buf = @socket_read($this->socket, 8192); and see if there are errors not getting caught there. You could also put a timer in that loop to see if it's looping and reading data off (which you should understand why that is happening) or if the socket_read is taking 1 second.

Are you passing in a timeout of 1 second perhaps when you create the connection?

caffeinejolt commented 6 years ago

Totally valid point on the PHP version - I will update to latest 5.6.x and test this out again and get back to you. Thanks again.

bdeshong commented 6 years ago

Can you explain to use why Net_Gearman_Connection::close() calls socket_set_block()?

We're using a two-second timeout on our end, and we're finding that, upon close() call, we're just blocked waiting to read from the socket for two seconds.

See the code in question here.

If we comment out the call to socket_set_block(), the close() method completes super fast. Even the 250ms or 1000ms blocking time makes a difference to us when we're trying to submit jobs as quickly as possible.

What was the thinking behind setting the socket to blocking? Or might it make sense to, at instantiation time or somehow, letting the caller indicate that they don't want to block when closing?

Thanks in advance!

brianlmoon commented 6 years ago

I will be honest, all of those changes were in response to an outage we had where workers, not clients were having connection issues that caused gearmand to fill up with CLOSE_WAIT connections. We use a 200ms timeout on the client side, so we likely never feel this problem. Looking at our internal repo where all those changes were originally made, the close function evolved over time and it's possible that not all of those tactics are required to get a clean shtudown. I am going to do some testing.

brianlmoon commented 6 years ago

@bdeshong Can you give the https://github.com/brianlmoon/net_gearman/tree/timeout_close_refactor branch a try and see if that resolves your issue? Setting the socket to blocking before closing is still the right choice. Otherwise, you don't really know if it closes and you can end up with a gearmand that runs out of file descriptors because of a ton of CLOSE_WAIT connections. The issue appears to be that the read loop was waiting on the timeout which is silly. If we can close the connection clean, then we don't need to try to read the buffer. If we are going to read the buffer, there is no reason to wait for whatever timeout was set for normal operations. We use a much lower timeout now when reading the buffer. But, in my tests, I never hit it. 99.9999% of the time the connection will close fine and the buffer read is never needed.

Further reading:

bdeshong commented 6 years ago

@brianlmoon Ah, this all makes sense, thanks for the explanation. We'll give this other branch a try in dev this morning. Stay tuned...

j03k64 commented 6 years ago

@brianlmoon should we open a pull request so we can discuss that timeout_close_refactor branch? It doesn't work out of the box (with at least 7.1) since you are calling askForWork with no parameters and the parameters on askForWork in Worker.php aren't optional.

This branch definitely makes the connection close much faster on MacOS.

Still doing some testing, but here is what I have: https://github.com/brianlmoon/net_gearman/compare/timeout_close_refactor...tave:upstream-changes-20180624?expand=1

j03k64 commented 6 years ago

From testing my fork of your branch (with a couple minor updates to fix the method signature) it looks like it performs much, much better.

Also, I've been running that if ($err && $err != 11 && $err != 35) { check in Client.php in production for quite a while so I am confident about that. I can't get through a test suite without having that there on MacOS. It seems like it was already being handled in a different block of code in Connection.php where both error codes 11 and 35 are ignored.

bdeshong commented 6 years ago

@brianlmoon / @jpirkey: Overall, this branch performs much better for us. However, we definitely see the same issue that @jpirkey described here. FWIW, we're running on PHP 7.2.6.

We believe that @jpirkey's change fixes the issue, so we'd love to see his changes accepted.

For the sake of keeping moving on our end and not being directly dependent on a dev branch on your end, we're going to fork from this branch, then apply @jpirkey's change that's not macOS-specific.

But with your change + @jpirkey's, we're a 👍on this.

j03k64 commented 6 years ago

@bdeshong fwiw, we've been running that error code handling in prod since Oct 13, 2017 with no issues. I've just been running off my own fork: https://github.com/tave/net_gearman/commit/b4c7eb31df9719d847e30b74a8d2168b7d7b2e74

I don't think that change is MacOS specific, it is just more prevalent on MacOS. image

source

So it should be fine to ignore these error codes.

brianlmoon commented 6 years ago

Let’s get that merged. Sounds like a winner.

bdeshong commented 6 years ago

Looks like @brianlmoon's changes from February are good. @brianlmoon, do you plan to open a PR for those or just merge this branch of yours to master?

@jpirkey, do you plan to open a PR from your forked changes to get them integrated back into the project, or did you specifically want to rely on your fork? I'd love to not have to rely on a fork on the ShootProof side, but also don't want to rush this for the sake of calling it done. 😄

I'm not the author of any of this code, but I'm happy to wrangle various commits from forks into a single PR if you all need me to!

j03k64 commented 6 years ago

I created a pull request against his branch (https://github.com/brianlmoon/net_gearman/pull/40).

I don't care to run my own fork either so assuming we are all good with this, I don't really have a preference on how it gets done. And while I have the ability to merge, I'd still prefer to defer to @brianlmoon on that.

brianlmoon commented 6 years ago

I will get this all merged in.

j03k64 commented 6 years ago

Made one last minor change to fix a variable name in Connection->close: https://github.com/brianlmoon/net_gearman/pull/40/files#diff-7f309b838cae721a520e158e201fb4d5L424

brianlmoon commented 6 years ago

Release 2.0.0 with these changes and more. https://github.com/brianlmoon/net_gearman/releases/tag/2.0.0