dunglas / mercure

🪽 An open, easy, fast, reliable and battery-efficient solution for real-time communications
https://mercure.rocks
GNU Affero General Public License v3.0
3.98k stars 297 forks source link

Shutting down Mercure server takes indefinitely #986

Open bobvandevijver opened 4 days ago

bobvandevijver commented 4 days ago

I was looking into a very slow shutdown of the Mercure instance I am running on a Ubuntu server, using a systemd unit.

Unit file ``` [Unit] Description=Mercure server After=network.target [Service] Type=simple EnvironmentFile=/opt/application/mercure/env ExecStart=/opt/application/mercure/mercure run Restart=always RestartSec=5 RuntimeMaxSec=7d WorkingDirectory=/opt/application/mercure User=www-data [Install] WantedBy=multi-user.target ```
Logs ``` Nov 21 20:15:40 linux645 mercure[1197]: {"level":"info","ts":1732216540.629172,"msg":"shutting down apps, then terminating","signal":"SIGTERM"} Nov 21 20:15:40 linux645 mercure[1197]: {"level":"warn","ts":1732216540.6292357,"msg":"exiting; byeee!! �<9F><91><8B>","signal":"SIGTERM"} Nov 21 20:15:40 linux645 mercure[1197]: {"level":"info","ts":1732216540.6292503,"logger":"http","msg":"servers shutting down with eternal grace period"} Nov 21 20:15:40 linux645 systemd[1]: Stopping mercure.service - Mercure server... Nov 21 20:16:19 linux645 mercure[1197]: {"level":"info","ts":1732216579.8925092,"logger":"http.handlers.mercure","msg":"Subscriber disconnected","subscriber":{"id":"urn:uuid:adce74f5-1f27-4912-91b6-83cd4442f0a> Nov 21 20:16:19 linux645 mercure[1197]: {"level":"info","ts":1732216579.89257,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"127.0.0.1","remote_port":"55950","client_ip":"127.0> Nov 21 20:16:31 linux645 mercure[1197]: {"level":"info","ts":1732216591.1713917,"logger":"http.handlers.mercure","msg":"Subscriber disconnected","subscriber":{"id":"urn:uuid:5db2a139-2a36-4f4d-a7e9-e4b1ed7d6fb> Nov 21 20:16:31 linux645 mercure[1197]: {"level":"info","ts":1732216591.17145,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"127.0.0.1","remote_port":"37544","client_ip":"127.0> Nov 21 20:17:10 linux645 systemd[1]: mercure.service: State 'stop-sigterm' timed out. Killing. Nov 21 20:17:10 linux645 systemd[1]: mercure.service: Killing process 1197 (mercure) with signal SIGKILL. Nov 21 20:17:10 linux645 systemd[1]: mercure.service: Failed to kill control group /system.slice/mercure.service, ignoring: Invalid argument Nov 21 20:17:10 linux645 systemd[1]: mercure.service: Main process exited, code=killed, status=9/KILL Nov 21 20:17:10 linux645 systemd[1]: mercure.service: Failed with result 'timeout'. ```

From the logs it became clear that the default shutdown action (SIGTERM), while received perfectly fine by caddy/mercure, didn't really do anything except for denying new connections. After a 90 seconds, systemd would have had enough, and use SIGKILL to get rid of the process.

One line caught my attention:

servers shutting down with eternal grace period

This is something that is described in the Caddy manual under the grace_period option and allows the clients to complete their request before the server is actually shutdown. However, with SSE and long running connections, this does not really make sense as those clients will keep the connection option.

I have now added an additional env variable to my configuration:

GLOBAL_OPTIONS=grace_period 1s

And immediately mercure shutdowns fast, although with some errors in the log:

Logs with 1s `grace_period` ``` Nov 21 20:41:43 linux645 mercure[46014]: {"level":"info","ts":1732218103.2779043,"msg":"shutting down apps, then terminating","signal":"SIGTERM"} Nov 21 20:41:43 linux645 mercure[46014]: {"level":"warn","ts":1732218103.2779415,"msg":"exiting; byeee!! 👋","signal":"SIGTERM"} Nov 21 20:41:43 linux645 mercure[46014]: {"level":"info","ts":1732218103.277979,"logger":"http","msg":"servers shutting down; grace period initiated","duration":1} Nov 21 20:41:43 linux645 systemd[1]: Stopping nl-mercure.service - Mercure server... Nov 21 20:41:44 linux645 mercure[46014]: {"level":"error","ts":1732218104.278682,"logger":"http","msg":"server shutdown","error":"context deadline exceeded","addresses":[":3000"]} Nov 21 20:41:44 linux645 mercure[46014]: {"level":"warn","ts":1732218104.2788424,"logger":"http.handlers.mercure","msg":"Failed to remove subscriber on shutdown","subscriber":{},"error":"hub: read/write on closed Transport"} Nov 21 20:41:44 linux645 mercure[46014]: {"level":"info","ts":1732218104.2788796,"logger":"http.handlers.mercure","msg":"LocalSubscriber disconnected","subscriber":{}} Nov 21 20:41:44 linux645 mercure[46014]: {"level":"warn","ts":1732218104.2788115,"logger":"http.handlers.mercure","msg":"Failed to remove subscriber on shutdown","subscriber":{},"error":"hub: read/write on closed Transport"} Nov 21 20:41:44 linux645 mercure[46014]: {"level":"info","ts":1732218104.27891,"logger":"http.handlers.mercure","msg":"LocalSubscriber disconnected","subscriber":{}} Nov 21 20:41:44 linux645 mercure[46014]: {"level":"info","ts":1732218104.27891,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"127.0.0.1","remote_port":"46232","client_ip":"127.0.0.1","proto":"HTTP/1.1","method":"GET","host":"localhost:3000","uri":"/.well-known/mercure?lastEventID=urn%3Auuid%3A74003c94-cdc2-4fdc-8948-d4db509702ff&topic=%2A","headers":{REDACTED}},"bytes_read":0,"user_id":"","duration":38.964192544,"size":407,"status":200,"resp_headers":{REDACTED}} Nov 21 20:41:44 linux645 mercure[46014]: {"level":"info","ts":1732218104.2789543,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"127.0.0.1","remote_port":"53440","client_ip":"127.0.0.1","proto":"HTTP/1.1","method":"GET","host":"localhost:3000","uri":"/.well-known/mercure?lastEventID=urn%3Auuid%3Add133490-ed99-4d32-968a-40ba36492738&topic=%2A","headers":{REDACTED},"bytes_read":0,"user_id":"","duration":41.440676106,"size":4,"status":200,"resp_headers":{REDACTED}} Nov 21 20:41:44 linux645 mercure[46014]: {"level":"info","ts":1732218104.279367,"logger":"admin","msg":"stopped previous server","address":"localhost:2019"} Nov 21 20:41:44 linux645 mercure[46014]: {"level":"info","ts":1732218104.2793803,"msg":"shutdown complete","signal":"SIGTERM","exit_code":0} Nov 21 20:41:44 linux645 systemd[1]: mercure.service: Deactivated successfully. Nov 21 20:41:44 linux645 systemd[1]: Stopped mercure.service - Mercure server. ```

As I did not find any note about this in the documentation, I am wondering what the best course of action is. Is this be something that should only be added to the documentation, or is this maybe even something that we can can configure by default in the Caddyfile that is bundled with the tar.gz releases?

dunglas commented 4 days ago

Good catch. IMHO, we should detect that Caddy is shutting down (we may use the context for that) and cleanly close the SSE connection on our side. This should be done in the Caddy module.

In the meantime, your workaround looks good and we could at least update the default config.

dunglas commented 6 hours ago

Thanks for the PR @bobvandevijver.

I dived into this issue, and it looks like the current behavior is "expected". When I try to reproduce locally, the server closes all connections and finally terminates:

{"level":"info","ts":1732559992.2463958,"msg":"shutting down apps, then terminating","signal":"SIGTERM"}
{"level":"warn","ts":1732559992.2464871,"msg":"exiting; byeee!! 👋","signal":"SIGTERM"}
{"level":"info","ts":1732559992.2465448,"logger":"http","msg":"servers shutting down with eternal grace period"}
{"level":"debug","ts":1732559992.247116,"logger":"http.handlers.mercure","msg":"Connection closed by the client","subscriber":{"id":"urn:uuid:e1d0b3df-f2a8-43c9-b19a-b82b59737c44","last_event_id":"","remote_addr":"127.0.0.1:51060","topic_selectors":["https://example.com/my-private-topic","{scheme}://{+host}/demo/books/{id}.jsonld","/.well-known/mercure/subscriptions{/topic}{/subscriber}"],"topics":["https://example.com/my-private-topic"]}}
{"level":"info","ts":1732559992.256612,"logger":"http.handlers.mercure","msg":"LocalSubscriber disconnected","subscriber":{"id":"urn:uuid:e1d0b3df-f2a8-43c9-b19a-b82b59737c44","last_event_id":"","remote_addr":"127.0.0.1:51060","topic_selectors":["https://example.com/my-private-topic","{scheme}://{+host}/demo/books/{id}.jsonld","/.well-known/mercure/subscriptions{/topic}{/subscriber}"],"topics":["https://example.com/my-private-topic"]}}
{"level":"info","ts":1732559992.256854,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_ip":"127.0.0.1","remote_port":"51060","client_ip":"127.0.0.1","proto":"HTTP/3.0","method":"GET","host":"localhost","uri":"/.well-known/mercure?topic=https%3A%2F%2Fexample.com%2Fmy-private-topic","headers":{"Sec-Gpc":["1"],"Priority":["u=0"],"Accept":["text/event-stream"],"Alt-Used":["localhost"],"Dnt":["1"],"Cookie":["REDACTED"],"Sec-Fetch-Site":["same-origin"],"Authorization":["REDACTED"],"Referer":["https://localhost/.well-known/mercure/ui/"],"Sec-Fetch-Dest":["empty"],"Sec-Fetch-Mode":["cors"],"Pragma":["no-cache"],"Cache-Control":["no-cache"],"User-Agent":["Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:132.0) Gecko/20100101 Firefox/132.0"],"Accept-Language":["fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3"],"Accept-Encoding":["gzip, deflate, br, zstd"]},"tls":{"resumed":false,"version":772,"cipher_suite":4865,"proto":"h3","server_name":"localhost"}},"bytes_read":0,"user_id":"","duration":40.463308167,"size":4,"status":200,"resp_headers":{"Expire":["0"],"Server":["Caddy"],"Content-Type":["text/event-stream"],"Pragma":["no-cache"],"Content-Security-Policy":["default-src 'self' mercure.rocks cdn.jsdelivr.net"],"Connection":["keep-alive"],"Cache-Control":["private, no-cache, no-store, must-revalidate, max-age=0"],"X-Accel-Buffering":["no"],"Date":["Mon, 25 Nov 2024 18:39:11 GMT"],"X-Frame-Options":["DENY"],"X-Content-Type-Options":["nosniff"],"X-Xss-Protection":["1; mode=block"]}}
{"level":"info","ts":1732559992.25741,"logger":"admin","msg":"stopped previous server","address":"localhost:2019"}
{"level":"info","ts":1732559992.257416,"msg":"shutdown complete","signal":"SIGTERM","exit_code":0}

I think that the issue you're getting is because some of your connections with the clients are "stuck", so the underlying goroutine is stuck and https://github.com/dunglas/mercure/blob/main/subscribe.go#L125 is never called.

Could you show me your timeout configs (if any)? Setting Mercure's write_timeout may not be enough. Setting write here in addition (or in place) of the grace can help: https://caddyserver.com/docs/caddyfile/options#timeouts

bobvandevijver commented 4 hours ago

I am using the default Caddy file with only some configuration for the server name, transport and keys.

SERVER_NAME=':3000'
MERCURE_PUBLISHER_JWT_KEY=something
MERCURE_SUBSCRIBER_JWT_KEY=something
MERCURE_EXTRA_DIRECTIVES='transport bolt {
  path mercure.db
  size 10000
}'

However, I am using a nginx proxy (based on https://mercure.rocks/docs/hub/nginx) to be able to host the hub on the same port and domain as the main Symfony application. Client IP detection doesn't work currently (but that is why I just responded to #546).

        location /.well-known/mercure {
                proxy_pass http://localhost:3000;
                proxy_read_timeout 24h;
                proxy_http_version 1.1;
                proxy_set_header Connection "";
                proxy_pass_header Upgrade;

                # Remove untrusted headers
                proxy_set_header Forwarded "";
                proxy_set_header X-Real-IP "";
                proxy_set_header X-Forwarded-Scheme "";

                ## Be sure to set USE_FORWARDED_HEADERS=1 to allow the hub to use those headers ##
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                proxy_set_header X-Forwarded-Host $host;
                proxy_set_header X-Forwarded-Proto $scheme;
        }

Maybe that is relevant? I can consistently reproduce the issue with a 10s timeout window, where a freshly created connection isn't dropped (also visible in the browser console) when the restart is requested. Only after 10s the onerror on the EventSource is triggered.