clj-commons / aleph

Asynchronous streaming communication for Clojure - web server, web client, and raw TCP/UDP
http://aleph.io
MIT License
2.54k stars 241 forks source link

[client] Allow to limit maximum time of a reusing client connection #565

Closed alexander-yakushev closed 1 year ago

alexander-yakushev commented 3 years ago

It is sometimes useful to be able to limit the lifetime of a reusable TCP connection in the client pool. This is the most naive implementation with minimal changes. Since http-connection already tracks the socket creation time, it can also check the current time against the deadline and signal upstream (to the request function) via :keep-alive? field that the connection should be scrapped. Jitter is also added to prevent reconnection avalanches. This feature has been validated in production.

@ztellman @kachayev What do you think about this? Is such functionality needed in the client, or the server should control the connection lifetimes itself? If the functionality is useful, is this implementation OK, or something more straightforward (and involving more changes) is warranted?

kachayev commented 3 years ago

@alexander-yakushev Thanks for the PR! I'm trying to think through potential use cases... I honestly never seen something like this being utilized. It's helpful in practice to limit the lifetime of idle connections (and there's a dedicated idle handler to make this happen). But if the connection is actively used... what would be a good reason to close it?

alexander-yakushev commented 3 years ago

I might have a poor understanding of how the following works, and there could be an easier solution, but this is the problem we had: a particular AWS ALB was suffocating under the request load (because of some experimental routing policies we enabled) and spawning new instances, but the client service didn't initiate new connections because the old ones were reused and were technically intact (although had degraded latency). Trying to limit the connection lifetime on the target service (the one balanced by ALB) I presume would not work since keep-alive through a balancer involves two separate TCP sockets (client<->balancer and balancer<->server), if I understand it correctly.

So, I guess it was an Amazon problem after all – it should start shedding the connections to its instance once it is overloaded. But for some reason it doesn't, so we had to employ this fix, and I thought it might be helpful in some other usecases as well.

kachayev commented 3 years ago

Oh, I see. The problem here is not an initial connection but rather what happens on the other side of a load balancer. I would say that the correct solution for HTTP/1.1 is a proper handling on ELB side (hence HTTP/1.1 was initially designed for such infrastructure blocks, unlike let's say HTTP/2+).

I guess if you have access to the connection, you can schedule http-client/close-connection when necessary. But that's not something that's exposed explicitly. And additional Netty handler on the pipeline might be quite tricky to align with all operations that Aleph is doing... Let me think a bit.

alexander-yakushev commented 2 years ago

Rebased.

arnaudgeiser commented 2 years ago

I've added some tests regarding this feature. However, I'm wondering if that's you expect from this. The connection won't be disposed once it's elapsed but the next call after it's elapsed (which could happen a few minutes/hours/days after).

arnaudgeiser commented 2 years ago

Even without answer from @alexander-yakushev, I would merge this PR into aleph. Yes it can be improved, yes it's not perfect but still, it solves an issue that some people might have without having to shuffle a lot of code.

I attach great importance to this:

This feature has been validated in production.

However, before merging it I would slightly adapt the comments to reflect what the code is really doing.

KingMob commented 2 years ago

I agree that the max-lifetime docstring should be updated. It should inform people that this will kill in-use connections, and that they probably would prefer to use the idle-timeout option instead.

However, I don't know of anyone needing this besides @alexander-yakushev, so I'd prefer to wait for him. I'd hate for him to come back and say, "Yeah, we changed our version of this code a year ago for reasons XYZ, I meant to update the PR."

arnaudgeiser commented 1 year ago

Since we don't have news regarding this feature and it's not a high demand one, I will close this PR. Do not hesitate to re-open.