netty: configurable server graceful shutdowns

pyr commented 2 years ago

The recently closed #641 and previous #639 were both attempting to improve the server shutdown procedure.

The intent of both these PRs is to make it easily viable to provide the classic blue-green deployment mode where:

Servers stop listening for incoming requests
In-flight requests are still processed for a configurable grace period
All resources are then cleaned up

After more digging (my Netty-foo being a bit rusty of late :-)) an approach relying on Netty's ChannelGroup functionality seems to be the right mechanism.

Of course, not everyone will want to keep a set of all active connections, including its memory and latency impact.

This revised PR modifies the previous approach in the following way:

Allow for hooking into the shutdown process by way of the shutdown-hook parameter (accepted by http/start-server,`tcp/start-server, and netty/start-server).
Provide a maximum time interval for shutdown-hook to run in, defaulting to 15 seconds
~Walk back configurability of shutdownGracefully parameters as they do not give a reliable way to control shutdown.~

In addition to this, a few utilities are provided to handle channel groups, as well as a ready made channel handler to populate a channel group, since this will be the most common use of shutdown-hook.

If this PR goes through, I am happy to open a subsequent one with updated docs with a walk-through on how to achieve these stop semantics.

(useful disclosure, @arnaudgeiser and I work together so take his approval seal with a grain of salt :smile:)

arnaudgeiser commented 2 years ago

Funny... I just added a commit on top of this PR and the tests ran....

pyr commented 2 years ago

@KingMob I agree with you, I still think hooking into the process could be important so I left the possibility open but now defaulted to creating a sensible hook.

The parameters need rewording and the docstrings need to be updated, but before going further I want to know if we agree on the flow:

by default, behave as before
if a shutdown-hook-timeout is provided but no hook, track requests and let them be processed before shutting down within the supplied interval
if both shutdown-hook-timeout and shutdown-hook are provided, run the provided hook

I suppose the parameters could be named: graceful-shutdown-period and graceful-shutdown-hook, would this work?

KingMob commented 2 years ago

For future reference, please don't force-push in the middle of reviewing a large PR. I personally love to rewrite history, but it has to be done carefully.

It makes it impossible to see only new changes since the last review. I only force push when a) nobody's started reviewing, b) everybody's done reviewing, c) all the code needs to be re-reviewed, or d) nobody's going to review the changes.

Having to scan over and/or re-read everything is a pain.

KingMob commented 2 years ago

Let's take a step back and think about the usage and the API.

Will users ever want/need NON-graceful shutdown? (I.e., giving channels time to wrap up)

If not, we should wrap their hook in one that automatically takes care of the graceful details, and eliminate fns they can mess with, either by inlining them, declaring them private, or at a bare minimum, apply ^:no-doc metadata. In this scenario, everyone always wants a graceful shutdown.

If they do need immediate shutdown, can we still use the graceful shutdown method (channel group, pipeline xform, etc), but tell users to set the timeout to 0 if they want it closed ASAP? It's intuitive and doesn't require two code paths.

If we are maintaining a separate code path, should there be a default, which should it be, and what should that look like?

:graceful-shutdown? (maybe defaulting to true)
:immediate-shutdown? (maybe defaulting to false)
:wait-for-channels-during-shutdown?
etc.

I would prefer to avoid separate code paths if avoidable. Conditionally making channel groups, altering the pipeline, and maintaining different hooks is really unappealing to me if we don't have to. My preference is to make it all graceful, and let people set the timeout to 0 if they're in a hurry.

pyr commented 2 years ago

Will users ever want/need NON-graceful shutdown? (I.e., giving channels time to wrap up)

Yes, high traffic proxies will not want to pay the memory and latency cost of maintaining the channel set.

pyr commented 2 years ago

If we are maintaining a separate code path, should there be a default, which should it be, and what should that look like?

:graceful-shutdown? (maybe defaulting to true)

:immediate-shutdown? (maybe defaulting to false)

:wait-for-channels-during-shutdown?

etc

Figuring this out was the intent of my last comment.

My initial hunch was that we cannot change the default behavior since aleph updates are usually relatively silent. If you feel this is warranted and provided we make it clear in the release notes, why not. I leave the judgment of what you feel is OK to you. If the behavior change is OK, I would indeed make it the default as it is the sanest thing to do.

The ideal option from my PoV:

Graceful shutdown by default with a 15 second timeout interval
No waiting on the event loop group (i.e: (.shutdownGracefully 0 0 TimeUnit/SECONDS))

Control-hooks:

graceful-shutdown-fn: change the hook if something else, additional
graceful-shutdown?:
graceful-shutdown-timeout: timeout period for the channel group close future, defaults to 15 seconds
group-shutdown-timeout: control the parameters to .shutdownGracefully group, quiet period would be (min 2 timeout) . Defaulting to 0.

If we want to preserve the existing behavior, graceful-shutdown? would need to be false by default, and group-shutdown-timeout set to 15.

So in essence, I guess the two good couple of defaults most people would want to use are:

[false 15]
[true 0]

KingMob commented 2 years ago

Yes, high traffic proxies will not want to pay the memory and latency cost of maintaining the channel set.

Hmmm, is there a way we can quantify this to know for sure? I would think a ChannelGroup introduces minimal memory overhead. It's two ConcurrentHashMaps per server that hold channel pointers, and adds/removes them at the beginning/end of their lives. Essentially some long arrays roughly the highest number of simultaneous channels with a little overhead.

Likewise, is the insertion time really that much overhead? (Removal time is after the response has been sent, so it shouldn't matter for latency.) I just timed calls to DefaultChannelGroup's .add(), the core operation, and it takes about 70 nanoseconds per call on my machine. That's ~15 million calls/s; even considering lock contention, you'd need a really high load before this becomes a problem, no?

If it's really crucial, we can implement two paths, but this seems like a premature optimization to me. Convince me otherwise?

KingMob commented 2 years ago

It's getting late in my TZ, so I have to go, but on one point:

we cannot change the default behavior since aleph updates are usually relatively silent.

You're right, this is usually true, but I don't think we've ever promised any behavior in regards to shutdown, so I think we have some extra leeway here.

arnaudgeiser commented 2 years ago

So, if we agree to change the current shutdown behavior, we can pivot only around shutdown-timeout which will default to 15 seconds. We don't even have to introduce the tracking-channel-handler, we can modify the default one. [1] Let's wait for a feature request to make it configurable if it has to (extract that behaviour inside a named ChannelHandler that can be removed with pipeline-transform).

We can remove both the shutdown-hook and the shutdown-hook-timeout. It solves a problem we don't have currently and that could be part of another PR if there is a need for it.

So I totally concur with this:

The ideal option from my PoV: Graceful shutdown by default with a 15 second timeout interval No waiting on the event loop group (i.e: (.shutdownGracefully 0 0 TimeUnit/SECONDS))

I think that's a compromise that suits everyone, right?

[1] : https://github.com/clj-commons/aleph/blob/master/src/aleph/http/server.clj#L353

DerGuteMoritz commented 2 years ago

Will users ever want/need NON-graceful shutdown? (I.e., giving channels time to wrap up)

Just to throw my 2 cents into the hat: I wanted this in the past for use in integration test suites. When there are a lot of tests which start a server at the beginning and stop it again at the end, you want to await their proper shutdown so that different tests don't collide (e.g. because they always bind to the same listening port). Having each test wait for at least 2 seconds adds up quickly then. This is a case where stopping the server as god intended doesn't work either :smile:

arnaudgeiser commented 2 years ago

With the approach that has been discussed, it will still be supported @DerGuteMoritz.

pyr commented 2 years ago

If it's really crucial, we can implement two paths, but this seems like a premature optimization to me. Convince me otherwise?

I think in most cases you are right, and for the odd case where it isn't, pipeline-transform would allow removing the handler from the pipeline. I will adapt the approach to always create the handler on the pipeline.

You're right, this is usually true, but I don't think we've ever promised any behavior in regards to shutdown, so I think we have some extra leeway here.

OK, this definitely makes thing easier.

Then I would do the following:

Make the default be to gracefully shutdown, and then shut the executor down immediately, (the equivalent of the [true 0] option above.
Provide graceful-shutdown-timeout and eventloop-shutdown-timeout:
- graceful-shutdown-timeout defaulting to 15 seconds. A value of 0 would disable waiting on the channel-group.
- eventloop-shutdown-timeout defaulting to 0 seconds.

The last question remaining is you would be interested in modifying the shutdown-hook or not. I'm fine with removing it

If we agree on the above I can go forward and adapt the PR, option names, and docs accordingly, let me know!

KingMob commented 2 years ago

Graceful by default sounds good to me. I think we have a solution.

Thanks for reminding me about test suites. Yeah, we definitely want a way to kill the server ASAP in that situation.

Having graceful-shutdown-timeout and eventloop-shutdown-timeout is OK, but the naming is confusing in regards to their relationship. Can we come up with better names? Failing that, we need some good docstrings.

I don't think we need to remove a user-supplied shutdown-hook option, as long as the user doesn't have to orchestrate "graceful-ness", and the docs are clear on when their hook is called. OTOH, if nobody needs the hook, we can skip it until someone asks for it. I leave that up to you.

arnaudgeiser commented 2 years ago

I don't think we need to remove a user-supplied shutdown-hook option, as long as the user doesn't have to orchestrate "graceful-ness", and the docs are clear on when their hook is called. OTOH, if nobody needs the hook, we can skip it until someone asks for it. I leave that up to you.

Agreed, but depending on where the shutdown-hook is placed, it should not be an aleph responsibility, but the responsibility of your favorite dependency management library (component, integrant, mount etc...)

Whether the shutdown-hook is placed between [1] and [2], it definitely has this place on Aleph
Whether the shutdown-hook is placed between [2] and [3], it should be the responsibility of a dependency management library

[1] : Stop listening to incoming requests [2] : Run the shutdown hook within the supplied timeout interval when provided. This can be used to finish processing in-flight requests for instance. [3] : At this stage, stop the EventLoopGroup

pyr commented 2 years ago

Agreed, but depending on where the shutdown-hook is placed, it should not be an aleph responsibility, but the responsibility of your favorite dependency management library (component, integrant, mount etc...)

I agree with this, and it would speak against the current place where on-close gets ran. on-close is necessary internally but maybe it shouldn't have been exposed since it can be fully orchestrated outside of aleph.

If we were to consider changing the placement of the hook point I think it would make sense. Otherwise, after having given it more thought I think it's too many knobs to tweak for library consumers.

Having graceful-shutdown-timeout and eventloop-shutdown-timeout is OK, but the naming is confusing in regards to their relationship. Can we come up with better names? Failing that, we need some good docstrings.

I agree with the poor naming, thinking about it, the knob that most people will want to tweak is the graceful one, so I would go with shutdown-timeout for this one (the existing one hasn't made it to a release yet).

After a bit of thought and looking at the feedback here, it seems is that the default parameters to EventLoopGroup::shutdownGracefully will work out for everyone except in odd scenarios like test ones, maybe we can go with fast-eventloop-shutdown? being a boolean.

So to sum it up one possible strategy would be:

shutdown-timeout: Controlling the amount of time the channel group
fast-eventloop-shutdown?: Would set the parameters of EventLoopGroup::shutdownGracefully to 0, 0, TimeUnit/SECONDS. Defaults to false

If everyone wants to consider the change of location of the hook I would make the current on-close parameter internal and use the same name for the hook that can run between server channel closing and EventLoopGroup shutdown.

arnaudgeiser commented 2 years ago

I agree with this, and it would speak against the current place where on-close gets ran. on-close is necessary internally but maybe it shouldn't have been exposed since it can be fully orchestrated outside of aleph.

Exactly, it should never have been exposed. But what has been done cannot be undone... :(

I agree with the poor naming, thinking about it, the knob that most people will want to tweak is the graceful one, so I would go with shutdown-timeout for this one (the existing one hasn't made it to a release yet).

Fine for me!

fast-eventloop-shutdown?: Would set the parameters of EventLoopGroup::shutdownGracefully to 0, 0, TimeUnit/SECONDS. Defaults to false

What about :

(if (zero? shutdown-timeout)
  (.shutdownGracefully group 0 0)
  (.shutdownGracefully group))

If everyone wants to consider the change of location of the hook I would make the current on-close parameter internal and use the same name for the hook that can run between server channel closing and EventLoopGroup shutdown.

To me, that's the way to go (if we are talking about putting the on-close parameter between [1] and [2] - what I understood).

[1] : Stop listening to incoming requests [2] : Run the shutdown hook within the supplied timeout interval when provided. This can be used to finish processing in-flight requests for instance. [3] : At this stage, stop the EventLoopGroup

pyr commented 2 years ago

Your proposal kills the opportunity to have fast test shutdowns unfortunately. For the rest we agree.

arnaudgeiser commented 2 years ago

And why not (.shutdownGracefully group 0 0) in all cases? Whether waiting for active connections is the default, it's acceptable to have this behavior as non-parameterizable, no?

That's what you proposed at some point:

Graceful shutdown by default with a 15 second timeout interval No waiting on the event loop group (i.e: (.shutdownGracefully 0 0 TimeUnit/SECONDS))

What did we discover in the meantime that Invalided this?

pyr commented 2 years ago

I adapted to the proposed approach, taking into account @arnaudgeiser's comment

pyr commented 2 years ago

Approving but would be nice to do the adaptations on the other tests to make them fast again. Furthermore, the title of the PR has to be changed to reflect what has been done.

I will do now that the direction seems to work for everyone, once I get a second approval I will also do a last force push to have proper commit messages. (But later today)

pyr commented 2 years ago

All comments have been addressed except for @KingMob's regarding shutdown-quiet-period which I think doesn't apply. Once we find an agreement on this last one I'll rework the commit messages and push a cleaned rebased set of commits.

@arnaudgeiser tests are indeed way faster now :-)

Thanks for the thorough reviews everyone!

pyr commented 2 years ago

@KingMob The code is ready for review, I will leave the rebase + force-push for after your review/approval

pyr commented 2 years ago

OK, from my perspective it's now clean and ready to go. A side question while I have your attention: the fact that external contributions do not get their tests ran is a bit inconvenient, would you be open to a PR that runs tests from Github actions which wouldn't need any particular setup? This could be done in addition to the Circle CI one but at least provide a functional test run for contributions such as this PR

clj-commons / aleph

netty: configurable server graceful shutdowns #642