basecamp / intermission

intermission helps you perform zero down time application maintenance
365 stars 20 forks source link

Behaviour when clients disconnect #5

Open sirupsen opened 9 years ago

sirupsen commented 9 years ago

Hi, as far as I can see there's no logic to deal with clients disconnecting prematurely. Due to how the queueing mechanism is implemented, this means that if a single client disconnects all clients that came after it would loop until they time out. If you want to remove them from the queue, you'd need to use something like ngx.on_abort or the log_by_lua handler.

However, even if code ran that removed them from the queue when the client disconnected—the queue would now be in a strange state where you'd have "gaps" in the queue. This means the paused_key .. id - 1 logic would be true for more than just one entry (the front of the queue) but also for elements in the middle of the queue. So e.g. if you have 1000 clients connecting, and 500 of them disconnect and they all happen to be spaced two apart—your worst case is to let 500 clients in at once. Unless I'm missing something and the full wait loop is run even for a disconnecting client (which would raise other issues).

Additionally another potential issue I can imagine is that because you're sleeping for some amount of time, popping a large queue can take a long time—a worst case of <queue size> * <sleep time>, which could easily exceed a minute with a queue of more than 10,000 requests, and likely start rejecting requests.

What I am contemplating doing for us is to not have a queue. I don't actually care if customers get into the platform in order, all I care about is that I know exactly how many requests I let in per second at a time (leaky bucket algorithm). The current implementation of intermission.lua doesn't seem to do that, because of the first problem I described with disconnecting clients actually increasing the thundering herd.

I'm also curious about memory and CPU usage with a large amount of requests per second, is this something you've benchmarked?

@anoldguy @tweibley

sirupsen commented 9 years ago

\cc @dhh since you recently talked about this, so it sounds like it's still in use (but has bugs as highlighted above).

At Shopify we're looking at creating a layer on top of the newly released lua-resty-limit-traffic. Let me know if there's interest in that from your side.

tweibley commented 9 years ago

Hey @Sirupsen the issue you describe is probably legit. In practice I don't believe we have ever seen it in production -- we don't pause more than a few seconds so the likelihood of a disconnect is low.

Regarding ordering we wanted to preserve that -- say you have multiple posts that change the order of a todo list's items... you basically want fifo so that ordering is preserved.

(Also, I think it's likely that the version of intermission we are running in production internally is different than what we have here in the repo -- I'll check in to that and push stuff back into this repo if need be.)

We too are in the process of evaluating lua-resty-limit-traffic but not for request pausing.

jasonm commented 8 years ago

Hey @Sirupsen I'm curious if you decided to go with intermission, built something atop lua-resty-limit-traffic, or something else? :smile:

csfrancis commented 7 years ago

We built a version of this for an internal production engineering hack days recently at Shopify.

We took a very different approach. ngx.semaphore was introduced in lua-nginx-module v0.10.0. Each nginx worker creates its own local semaphore instance with a count initialized to 0. When pausing is enabled, incoming requests will wait on the semaphore. This causes the Lua light thread to effectively pause, until the semaphore is posted. When pausing is disabled, a timer can be spawned to unblock the paused requests by posting to the semaphore until its count increases to 0. By sleeping in the timer function, you can perform a slow drain of all blocked requests.

This approach has the following advantages:

Another little trick you can do is use FFI to call ngx_http_lua_rd_check_broken_connection. This function allows you to check if the client connection was closed, and if so abort the request. By performing this check after returning from the semaphore wait, you can short circuit the request. Otherwise, the request will needlessly go to the upstream.

nickjj commented 3 years ago

Has there been any updates that's used internally at Basecamp to fix this issue, or a Shopify open source variant that bypasses the bug by using a different strategy? Trying to implement a fix or the semaphore solution is above my pay grade.

I'd like to use intermission as part of my normal deployment pipeline on single server deploys to achieve zero dropped requests during deploys. The back-end would likely be down for 5-10 seconds on each deploy. I'm kind of reluctant to roll this out if 1 client disconnect during intermission can negatively affect everyone in the queue by having them all receive a timeout.

svilen-ivanov commented 1 year ago

@nickjj I've implemented an improved version this project here. It solves some of the issues mentioned by @sirupsen and removes the ordering requirement. In my opinion, you can't maintain such contract in a web environment (considering the example here) if a client performs requests without waiting for the previous (potentially held in-flight) request to finish. In that case out-of-order execution could happen anytime, even during normal operation of the application