Add option to retry on ECONNRESET or connection close

Clever / thrift-pool

A module that wraps thrift interfaces in connection pooling logic to make them more resilient.

Apache License 2.0

22 stars 15 forks source link

Add option to retry on ECONNRESET or connection close #7

Closed springheeledjak closed 8 years ago

springheeledjak commented 9 years ago

Both Node and external load balancers have a habit of being temperamental when it comes to long-lived sockets, so I've added a pool_options.retries option to attempt to retry any function call if the socket is closed or reset mid-call.

I believe there's also a minor race condition between the close handler and the validate method passed to the pool (namely around the value of connection.__ended), but this works around it without needing expensive/complicated locking or the like.

cozmo commented 9 years ago

Thanks for the PR! I'll take a look :smile:

cozmo commented 9 years ago

@springheeledjak What are your thoughts around putting this in the library vs the caller? It seems to add a lot of complexity to the library, and in general we (Clever) have tended towards retry logic being the responsibility of the consumer. You can see this with some of our client libraries like https://github.com/Clever/clever-js. In fact internally we have put retry logic in some callers of thrift-pool.

I don't have a strong opinion here, but am hesitant to add a lot of complexity to a library which is already relatively opaque in how it works.

springheeledjak commented 9 years ago

@cozmo Hm. One one hand I'm all for that kind of separation of concerns, but on the other I feel that, at least to a degree, the job of any kind of connection pool should be to provide reliable connections above all else. Given the asynchronous nature of Node sockets and the tendency of ELBs and other load balancers to reset/close connections, I feel like in this case a "reliable connection" must necessarily include the ability to handle those issues without disrupting the caller.

Another significant argument in favor is the avoidance of too many layers of abstraction: having a retry layer, on top of a pooling layer (or vice versa), on top of the base Thrift client seems a bit deep of a stack for an RPC client you're using everywhere. To be fair there's a solid counterargument to be made regarding treating all of these pieces as composable/stackable middleware, though refactoring this library to plug into a middleware framework would be a not-insignificant engineering effort...

azylman commented 9 years ago

Maybe the right way to fix this would be to fork node-generic-pool so that the "validate" function (what validates a connection before returning it in acquire) took a callback, and then having "validate" try to ping the connection. I think that would make a lot of this logic simpler, and better get at the intended behavior. Thoughts?

springheeledjak commented 8 years ago

@azylman hmm, I like that. It's kinda going in the same direction as stackable middleware, but perhaps less complicated (and certainly more expedient).

That said, I stand by my assertion that robustness against unexpected socket closure is an essential part of providing a reliable connection, and providing reliable connections is really the only point I can think of to even use a pool. Without that, you're forcing every single caller to implement retry logic, which feels like an awful lot of unnecessary wheel-reinvention. Especially since 99% of callers are going to want to not drop the call if it's just an ECONNRESET.

azylman commented 8 years ago

It looks like as of thrift 0.9.2 there's an option max_attempts that you can pass to thrift to make it reconnect on connection failures: https://issues.apache.org/jira/browse/THRIFT-2058. As far as I can tell there's no documentation about it, but it exists.

Might be worth trying that and seeing if it works for you?

azylman commented 8 years ago

generic-pool added a validateAsync method now, so we could do this