clj-commons / aleph

Asynchronous streaming communication for Clojure - web server, web client, and raw TCP/UDP
http://aleph.io
MIT License
2.54k stars 241 forks source link

Accept shutdownGracefully parameters #638

Closed arnaudgeiser closed 2 years ago

arnaudgeiser commented 2 years ago

Currently, shutdownGracefully cannot be configured and we rely on Netty's default behavior which will accept new connections during 2 seconds and stop the ~EventExecutor~ (EDIT: wrong EventLoopGroup) after 15 seconds. [1]

Some workloads need this behavior to be tweakable.

[1] : https://github.com/netty/netty/blob/b61d7d40f40e3a797e8a60cd567f849a9799c771/common/src/main/java/io/netty/util/concurrent/AbstractEventExecutor.java#L40-L41 [2] : https://github.com/clj-commons/aleph/blob/master/src/aleph/netty.clj#L1282

DerGuteMoritz commented 2 years ago

Oh this is great! The default timeouts are particularly egregious when running integration tests. This actually made me remove the netty/wait-for-close call at some point. Glad to hear that there is a way to parameterize this.

arnaudgeiser commented 2 years ago

While those parameters are now configurable, they will work as expected when you are trying to decrease them, not increase them.

Here is an example:

(defn handler [{:keys [body]}]
  (let [res (s/reduce (fn [acc _]
                        (prn acc)
                        (Thread/sleep 1000)
                        (inc acc))
                      0 body)]
    {:body (str @res)}))

(def server (http/start-server
              #'handler
              {:port 9999
               :raw-stream? true
               :shutdown-timeout 200
               :shutdown-quiet-period 100}))

Then upload a huge file

curl localhost:9999 --data-binary @/tmp/100GiB

Wait five seconds

1
2
3
4
5

Stop the server

(.close server)

It's stopped after 5 seconds (instead of 200 you could expect)

6
7
8
9
10

Connection is closed client side

curl: (55) Send failure: Broken pipe

But this is the expected behaviour of Netty according to this comment : https://github.com/netty/netty/pull/3706#issuecomment-97715011

cc @pyr

Closing the issue as it has been addressed by #639

DerGuteMoritz commented 2 years ago

While those parameters are now configurable, they will work as expected when you are trying to decrease them, not increase them.

Hm when reading your example scenario, it appears that it's rather the other way around, i.e. decreasing isn't possible? :thinking:

arnaudgeiser commented 2 years ago

(I will come with more examples and some pinpoints on Netty later today/in the evening)

arnaudgeiser commented 2 years ago

Oh this is great! The default timeouts are particularly egregious when running integration tests. This actually made me remove the netty/wait-for-close call at some point. Glad to hear that there is a way to parameterize this.

With the PR that just got merged, you will be able to solve your problem by specifying a shutdown-timeout (which defaults to 15 seconds). This parameter will guarantee your EventLoopGroup will be stopped (netty/wait-for-close) after at most X seconds.

This totally works the way you expect.

By default

(def server (http/start-server
              #'handler
              {:port 9999
               :raw-stream? true}))

(do (.close server)
    (netty/wait-for-close server))

Even without any traffic, it might take a few seconds for netty/wait-for-close to return.

With 0 second

(def server (http/start-server
              #'handler
              {:port 9999
               :raw-stream? true
               :shutdown-quiet-period 0 ;; shutdown-quiet-period should be less than shutdown-timeout
               :shutdown-timeout 0}))

(do (.close server)
    (netty/wait-for-close server))

It will be closed instantly.

With 200 second

(def server (http/start-server
              #'handler
              {:port 9999
               :raw-stream? true
               :shutdown-quiet-period 200 ;; shutdown-quiet-period should be less than shutdown-timeout
               :shutdown-timeout 200}))

(do (.close server)
    (netty/wait-for-close server))

This will guarantee the netty/wait-for-close will return after 200 seconds. All good!


Now, the issue is with the shutdown-quiet-period which, according the documentation :

Unlike [shutdown()](https://javadoc.io/static/io.netty/netty-all/4.1.8.Final/io/netty/util/concurrent/EventExecutorGroup.html#shutdown()), graceful shutdown ensures that no tasks are submitted for 'the quiet period' (usually a couple seconds) before it shuts itself down. If a task is submitted during the quiet period, it is guaranteed to be accepted and the quiet period will start over.

This is actually confusing... and can be translated in different ways. One way is what just got merged :

optional period in seconds for which new connections will still be serviced after a scheduled shutdown via java.io.Closeable#close. Defaults to 2 seconds.

But according to https://github.com/netty/netty/pull/3706#issuecomment-97715011, this understanding is totally inaccurate.

The EventLoopGroup will continue to accept tasks! But all the EventExecutor will be closed as soon as shutdownGracefully has been called. [1]

@pyr shout on the issue to understand what should be done to support what we could call a graceful shutdown (continue serving current requests) [2] If we have no answer from Norman, we should probably have a look at the other Netty based HTTP servers on the Java ecosystem (Vertx, Micronaut, Undertow, Akka?)

I will reopen the issue, we should at least revisit the docstring associated to the shutdown-quiet-period.

[1] : https://github.com/netty/netty/blob/4.1/transport-classes-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java#L420 [2] https://github.com/netty/netty/pull/3706#issuecomment-1303066857

arnaudgeiser commented 2 years ago

Quickly, here are the pointers for VertX [1] and Micronaut [2].

[1] : https://github.com/eclipse-vertx/vert.x/blob/11ab144725e551be591118831377101015e1392c/src/main/java/io/vertx/core/net/impl/VertxEventLoopGroup.java#L100-L107 [2] : https://github.com/micronaut-projects/micronaut-core/blob/231cd469962c88f7b457b5b09e539b5655e568e5/http-server-netty/src/main/java/io/micronaut/http/server/netty/NettyHttpServer.java#L589-L619

pyr commented 2 years ago

GRPC is also a likely good inspiration, see https://github.com/grpc/grpc-java/blob/master/netty/src/main/java/io/grpc/netty/NettyServer.java#L344-L371

pyr commented 2 years ago

The call to ServerListener::shutdown in GRPC seems indicate it's the right inspiration: https://github.com/grpc/grpc-java/blob/master/core/src/main/java/io/grpc/internal/ServerListener.java#L33-L37

arnaudgeiser commented 2 years ago

Anyway, this issue can be closed either whether:

  1. We change the docstring of shutdown-quiet-period to match Netty behaviour
  2. We implement exactly what the docstring mentions

I can't see any reason why aleph users would expect 1. I would really strive for 2.

arnaudgeiser commented 2 years ago

Just for completeness, there is one situation where it's working the way you would expect it => if you don't leave the EventLoop.

One obvious way is to pass executor :none :

(defn handler [{:keys [body]}]
  (let [res (s/reduce (fn [acc v]
                        (prn (.getName (Thread/currentThread)))
                        (Thread/sleep 1000) ;; blocking the EventLoop !!!
                        (inc acc)) 0 body)]
    (d/chain' res (fn [n]
                    {:body (str n)}))))

(def server (http/start-server #'handler
                               {:raw-stream? true
                                :port 9999
                                :executor :none}))

(do (.close server)
    (netty/wait-for-close server))

Output

"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"

The other way is to pass a custom executor which defines :onto? false

(defn handler [{:keys [body]}]
  (let [res (s/reduce (fn [acc v]
                        (prn (.getName (Thread/currentThread)))
                        (Thread/sleep 1000) ;; blocking the EventLoop !!!
                        (inc acc)) 0 body)]
    (d/chain' res (fn [n]
                    {:body (str n)}))))

(def executor (flow/utilization-executor 0.9 512
                     {:onto? false}))

(def server (http/start-server #'handler
                               {:raw-stream? true
                                :port 9999
                                :executor executor}))

(do (.close server)
    (netty/wait-for-close server))

Output

"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
"aleph-netty-server-event-pool-2"
arnaudgeiser commented 2 years ago

Thinking loudly...

Maybe we should track the active Channels by using a ChannelGroup [1]. Before calling .shutdownGracefully, we wait for the ChannelGroup to be empty as it removes closed channels automatically.

Simplify shutdown process with ChannelGroup If both ServerChannels and non-ServerChannels exist in the same ChannelGroup, any requested I/O operations on the group are performed for the ServerChannels first and then for the others. This rule is very useful when you shut down a server in one shot:

This seems to be exactly what we need.

Someone wrote a guide about how to write a chat with Netty which broadly describes the idea. [2]

WDYT?

[1] : https://netty.io/4.0/api/io/netty/channel/group/ChannelGroup.html [2] : https://github.com/pyr/net/blob/a2120bf61b6fd480707ea6b217545b3d640288d0/examples/guides/chat.md?plain=1#L32-L34

arnaudgeiser commented 2 years ago

I have a beginning of something there with this approach. [1] According to the first tests, it behaves as expected.

https://github.com/clj-commons/aleph/pull/641/files?diff=unified&w=1

DerGuteMoritz commented 2 years ago

With the PR that just got merged, you will be able to solve your problem by specifying a shutdown-timeout (which defaults to 15 seconds). This parameter will guarantee your EventLoopGroup will be stopped (netty/wait-for-close) after at most X seconds.

Just for the record: I can confirm that setting both new options to 0 works as adverstised :+1: Thanks!

I haven't fully wrapped my head around the remainig issue but looks like you guys are well underway :smile:

KingMob commented 2 years ago

looks like you guys are well underway

Don't look at me, I stop my servers with Ctrl+C, the way god intended 😁

arnaudgeiser commented 2 years ago

Addressed by : https://github.com/clj-commons/aleph/pull/642