danielniccoli commented 11 years ago

Original author: stefanta...@gmail.com (June 07, 2009 11:57:28)

The HTTP reverse proxy handler should not fail when only one of many information sources balanced with round-robin can't be accessed or time-outs. Instead it should repeat the request to the next available source. This would allow for zero-downtime when restarting the information sources one at a time.

Original issue: http://code.google.com/p/cherokee/issues/detail?id=489

danielniccoli commented 11 years ago

From ste...@konink.de on June 07, 2009 20:11:21 We have discussed this before in a more general way for example for fastcgi nodes; I agree with you on this point, it should be checked on the next available one.

danielniccoli commented 11 years ago

From alobbs on June 12, 2009 08:27:51 Yeah, I agree with you guys.

I just would like to add a little note for the record though. Once an information source has failed it's disabled for a reasonable amount of time. This means that it will only return a single error during the elapse of ~5 mins.

Again, it ought not to return that error before trying with the rest of the sources; you are right.

danielniccoli commented 11 years ago

From stefanta...@gmail.com on August 31, 2009 18:06:56 I propose the attached patch to fix this problem and provide zero-downtime while restarting multiple sources under the same balancer one by one.

The test environment is a reverse-proxy handler with a round-robin balancer managing 4 cherrypy servers (a custom Django setup). An ab benchmark (with concurrency above 4) is started and the sources are restarted one by one before ab finishes its requests. The end result is that all the requests are completed (HTTP 200) with a slightly higher time per request than the case where the sources are left running. Zero downtime achieved!

The patch does 2 things:

it changes cherokee_connection_open_request() so instead of just passing along the value returned by cherokee_handler_init() it forces a ret_eagain if the value is not ret_ok and the connection timeout is not expired. This is done so the handler can pick a good source and try to complete the request again. The weak point is that the default connection timeout there is 600s. We should use server!timeout for a more sane behavior.
the balancer's disable timeout is now a configuration variable instead of a hard-coded define. The config file entry looks like "vserver!10!rule!600!handler!balancer!balancer_disable_timeout = 10", has been implemented in the admin and if it's not available in the config file the old default of BAL_DISABLE_TIMEOUT (300s) is used. The only problem here might be the lack of po file entries but if you like the patch you'll change the strings anyway.

This patch should also solve the issues #288, #503 and #504 (Github: #502) .

danielniccoli commented 11 years ago

From ste...@konink.de on August 31, 2009 20:29:46 If it is what you write, I already like it...

danielniccoli commented 11 years ago

From alobbs on September 04, 2009 08:56:25 I've just reviewed the patch. The new timeout parameter is useful, it'll make it to upstream. Good stuff!

However, I can't quite understand this chunk (neither the rational behind it):

if(ret != ret_ok && cherokee_bogonow_now < conn->timeout) { usleep(1000); return ret_eagain;

}

danielniccoli commented 11 years ago

From stefanta...@gmail.com on September 18, 2009 21:44:33 It's an ugly hack to force a retry when an information source is not available (because it's being restarted). I need to rework this portion of the patch to limit the cases in which it retries to connect and have config entries for the maximum number of retries and the interval between them.

danielniccoli commented 11 years ago

From stefanta...@gmail.com on September 29, 2009 16:27:10 I've split the patch so you can integrate just the balancer_disable_timeout config parameter for now. The connection retrying part is not directly related to it anyway.

cherokee / webserver

sub-optimal handling of information source restart in a round-robin reverse proxy configuration #490

}