Closed arnaudgeiser closed 2 years ago
Oh this is great! The default timeouts are particularly egregious when running integration tests. This actually made me remove the netty/wait-for-close
call at some point. Glad to hear that there is a way to parameterize this.
While those parameters are now configurable, they will work as expected when you are trying to decrease them, not increase them.
Here is an example:
(defn handler [{:keys [body]}]
(let [res (s/reduce (fn [acc _]
(prn acc)
(Thread/sleep 1000)
(inc acc))
0 body)]
{:body (str @res)}))
(def server (http/start-server
#'handler
{:port 9999
:raw-stream? true
:shutdown-timeout 200
:shutdown-quiet-period 100}))
Then upload a huge file
curl localhost:9999 --data-binary @/tmp/100GiB
Wait five seconds
1
2
3
4
5
Stop the server
(.close server)
It's stopped after 5 seconds (instead of 200 you could expect)
6
7
8
9
10
Connection is closed client side
curl: (55) Send failure: Broken pipe
But this is the expected behaviour of Netty according to this comment : https://github.com/netty/netty/pull/3706#issuecomment-97715011
cc @pyr
Closing the issue as it has been addressed by #639
While those parameters are now configurable, they will work as expected when you are trying to decrease them, not increase them.
Hm when reading your example scenario, it appears that it's rather the other way around, i.e. decreasing isn't possible? :thinking:
(I will come with more examples and some pinpoints on Netty later today/in the evening)
Oh this is great! The default timeouts are particularly egregious when running integration tests. This actually made me remove the netty/wait-for-close call at some point. Glad to hear that there is a way to parameterize this.
With the PR that just got merged, you will be able to solve your problem by specifying a shutdown-timeout
(which defaults to 15 seconds). This parameter will guarantee your EventLoopGroup
will be stopped (netty/wait-for-close
) after at most X seconds.
This totally works the way you expect.
By default
(def server (http/start-server
#'handler
{:port 9999
:raw-stream? true}))
(do (.close server)
(netty/wait-for-close server))
Even without any traffic, it might take a few seconds for netty/wait-for-close
to return.
With 0 second
(def server (http/start-server
#'handler
{:port 9999
:raw-stream? true
:shutdown-quiet-period 0 ;; shutdown-quiet-period should be less than shutdown-timeout
:shutdown-timeout 0}))
(do (.close server)
(netty/wait-for-close server))
It will be closed instantly.
With 200 second
(def server (http/start-server
#'handler
{:port 9999
:raw-stream? true
:shutdown-quiet-period 200 ;; shutdown-quiet-period should be less than shutdown-timeout
:shutdown-timeout 200}))
(do (.close server)
(netty/wait-for-close server))
This will guarantee the netty/wait-for-close
will return after 200 seconds.
All good!
Now, the issue is with the shutdown-quiet-period
which, according the documentation :
Unlike [shutdown()](https://javadoc.io/static/io.netty/netty-all/4.1.8.Final/io/netty/util/concurrent/EventExecutorGroup.html#shutdown()), graceful shutdown ensures that no tasks are submitted for 'the quiet period' (usually a couple seconds) before it shuts itself down. If a task is submitted during the quiet period, it is guaranteed to be accepted and the quiet period will start over.
This is actually confusing... and can be translated in different ways. One way is what just got merged :
optional period in seconds for which new connections will still be serviced after a scheduled shutdown via
java.io.Closeable#close
. Defaults to 2 seconds.
But according to https://github.com/netty/netty/pull/3706#issuecomment-97715011, this understanding is totally inaccurate.
The EventLoopGroup
will continue to accept tasks! But all the EventExecutor
will be closed as soon as shutdownGracefully
has been called. [1]
@pyr shout on the issue to understand what should be done to support what we could call a graceful shutdown (continue serving current requests) [2] If we have no answer from Norman, we should probably have a look at the other Netty based HTTP servers on the Java ecosystem (Vertx, Micronaut, Undertow, Akka?)
I will reopen the issue, we should at least revisit the docstring associated to the shutdown-quiet-period
.
[1] : https://github.com/netty/netty/blob/4.1/transport-classes-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java#L420 [2] https://github.com/netty/netty/pull/3706#issuecomment-1303066857
Quickly, here are the pointers for VertX [1] and Micronaut [2].
[1] : https://github.com/eclipse-vertx/vert.x/blob/11ab144725e551be591118831377101015e1392c/src/main/java/io/vertx/core/net/impl/VertxEventLoopGroup.java#L100-L107 [2] : https://github.com/micronaut-projects/micronaut-core/blob/231cd469962c88f7b457b5b09e539b5655e568e5/http-server-netty/src/main/java/io/micronaut/http/server/netty/NettyHttpServer.java#L589-L619
GRPC is also a likely good inspiration, see https://github.com/grpc/grpc-java/blob/master/netty/src/main/java/io/grpc/netty/NettyServer.java#L344-L371
The call to ServerListener::shutdown
in GRPC seems indicate it's the right inspiration: https://github.com/grpc/grpc-java/blob/master/core/src/main/java/io/grpc/internal/ServerListener.java#L33-L37
Anyway, this issue can be closed either whether:
shutdown-quiet-period
to match Netty behaviourI can't see any reason why aleph users would expect 1. I would really strive for 2.
Just for completeness, there is one situation where it's working the way you would expect it => if you don't leave the EventLoop
.
One obvious way is to pass executor :none
:
(defn handler [{:keys [body]}]
(let [res (s/reduce (fn [acc v]
(prn (.getName (Thread/currentThread)))
(Thread/sleep 1000) ;; blocking the EventLoop !!!
(inc acc)) 0 body)]
(d/chain' res (fn [n]
{:body (str n)}))))
(def server (http/start-server #'handler
{:raw-stream? true
:port 9999
:executor :none}))
(do (.close server)
(netty/wait-for-close server))
Output
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
The other way is to pass a custom executor which defines :onto? false
(defn handler [{:keys [body]}]
(let [res (s/reduce (fn [acc v]
(prn (.getName (Thread/currentThread)))
(Thread/sleep 1000) ;; blocking the EventLoop !!!
(inc acc)) 0 body)]
(d/chain' res (fn [n]
{:body (str n)}))))
(def executor (flow/utilization-executor 0.9 512
{:onto? false}))
(def server (http/start-server #'handler
{:raw-stream? true
:port 9999
:executor executor}))
(do (.close server)
(netty/wait-for-close server))
Output
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
Thinking loudly...
Maybe we should track the active Channels
by using a ChannelGroup
[1].
Before calling .shutdownGracefully
, we wait for the ChannelGroup
to be empty as it removes
closed channels automatically.
Simplify shutdown process with ChannelGroup If both ServerChannels and non-ServerChannels exist in the same ChannelGroup, any requested I/O operations on the group are performed for the ServerChannels first and then for the others. This rule is very useful when you shut down a server in one shot:
This seems to be exactly what we need.
Someone wrote a guide about how to write a chat with Netty which broadly describes the idea. [2]
WDYT?
[1] : https://netty.io/4.0/api/io/netty/channel/group/ChannelGroup.html [2] : https://github.com/pyr/net/blob/a2120bf61b6fd480707ea6b217545b3d640288d0/examples/guides/chat.md?plain=1#L32-L34
I have a beginning of something there with this approach. [1] According to the first tests, it behaves as expected.
https://github.com/clj-commons/aleph/pull/641/files?diff=unified&w=1
With the PR that just got merged, you will be able to solve your problem by specifying a
shutdown-timeout
(which defaults to 15 seconds). This parameter will guarantee yourEventLoopGroup
will be stopped (netty/wait-for-close
) after at most X seconds.
Just for the record: I can confirm that setting both new options to 0 works as adverstised :+1: Thanks!
I haven't fully wrapped my head around the remainig issue but looks like you guys are well underway :smile:
looks like you guys are well underway
Don't look at me, I stop my servers with Ctrl+C, the way god intended 😁
Addressed by : https://github.com/clj-commons/aleph/pull/642
Currently, shutdownGracefully cannot be configured and we rely on Netty's default behavior which will accept new connections during 2 seconds and stop the ~
EventExecutor
~ (EDIT: wrongEventLoopGroup
) after 15 seconds. [1]Some workloads need this behavior to be tweakable.
netty.clj
so people can rely on them [2]start-server
accept those new options (please refactor, add another arity function with the parameter as a map) [2][1] : https://github.com/netty/netty/blob/b61d7d40f40e3a797e8a60cd567f849a9799c771/common/src/main/java/io/netty/util/concurrent/AbstractEventExecutor.java#L40-L41 [2] : https://github.com/clj-commons/aleph/blob/master/src/aleph/netty.clj#L1282