caddyserver / caddy

Fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS
https://caddyserver.com
Apache License 2.0
55.45k stars 3.91k forks source link

Active websocket connections are still closed on config reload #6420

Closed daweedm closed 6 days ago

daweedm commented 6 days ago

1. Environment

1a. Operating system and version

Docker version 20.10.21, build baeda1f

# Dockerfile
FROM caddy:2.6.2
RUN apk add curl
HEALTHCHECK --interval=10s --timeout=3s CMD curl --fail http://127.0.0.1:2016/config/ || exit 1

1b. Caddy version (run caddy version or paste commit SHA)

v2.8.4 h1:q3pe0wpBj1OcHFZ3n/1nl4V4bxBrYoSoab7rL9BMYNk=

2. Description

2a. What happens (briefly explain what is wrong)

Websocket connections are still closed on config reload despite #5471. We added the stream_close_delay flag in our config. But once we add a new handle using the API, we notice that all of websocket clients get disconnected for a short time. This is systematic : we made a script that triggers the API call and we clearly notice the systematic disconnect on all websocket client's logs.

2d. Workaround(s)

None

2e. Relevant links

5471

3. Tutorial (minimal steps to reproduce the bug)

  1. Connect a client to the caddy web server using websocket
  2. Update the caddy config using the API
  3. The client is disconnected from websocket

Example of configuration :

{
  "srv0": {
    "automatic_https": {
      "disable": true
    },
    "errors": {
      "@id": "srv0-error",
      "routes": [
        {
          "handle": [
            {
              "handler": "rewrite",
              "uri": "{http.matchers.file.relative}"
            },
            {
              "handler": "file_server",
              "root": "/var/www/html"
            }
          ],
          "match": [
            {
              "file": {
                "root": "/var/www/html",
                "try_files": [
                  "{http.request.uri.path}",
                  "index.html",
                  "=404"
                ]
              }
            }
          ]
        }
      ]
    },
    "listen": [
      ":443"
    ],
    "routes": [
      {
        "handle": [
          {
            "handler": "subroute",
            "routes": [
              {
                "handle": [
                  {
                    "body": "port 443",
                    "handler": "static_response"
                  }
                ]
              }
            ]
          }
        ],
        "match": [
          {
            "host": [
              "our-domain.com"
            ]
          }
        ],
        "terminal": true
      },
      {
        "@id": "srv0-u982998-p443",
        "handle": [
          {
            "handler": "reverse_proxy",
            "stream_close_delay": "60s",
            "transport": {
              "protocol": "http",
              "tls": {
                "server_name": "u982998.our-domain.com"
              }
            },
            "upstreams": [
              {
                "dial": "our-dial.com:30190"
              }
            ]
          },
          {
            "handler": "reverse_proxy",
            "stream_close_delay": "60s",
            "transport": {
              "protocol": "http",
              "tls": {
                "server_name": "*.u982998.our-domain.com"
              }
            },
            "upstreams": [
              {
                "dial": "our-dial.com:30190"
              }
            ]
          }
        ],
# etc ...
mholt commented 6 days ago

Thanks for the issue. Do you have a minimal reproducer though? Because the web socket delay timeout works for me...

daweedm commented 6 days ago

Thanks for your quick response. The issue we encouter are the websockets being closed on configuration reload but by reading again #5471, I may didn't understood the purpose of the stream_close_delay option.

Does this option ensure that websockets are not closed at all on config reload (which is what I'm looking for) or that they are always closed but after a given delay timeout ?

francislavoie commented 6 days ago

stream_close_delay defers the closing of websockets until the end of the delay, starting from the config reload. The point is to allow connections to "die off" naturally but still set a limit for how long they'll stick around.

This is important because otherwise, a connection that stays open forever will cause a reference to the old config to stay in memory and never get cleaned up until all the connections are closed.

Setting it is basically saying "yes I understand memory usage may increase because old configs will stick around and not get garbage collected until the delay is over". Just set it to something long enough that your connections can reasonably die off naturally, but not so long that memory usage from config reloads becomes a concern.

But either way, your apps should be set up such that a disconnect will trigger a reconnect attempt. Not having client-side reconnections is a bug.

daweedm commented 6 days ago

Our websocket clients have already a reconnection mechanism implemented. The problem is the reconnection is happening too frequently.

In our case, we use a caddy server as a central point to access several unrelated between each other web servers. We have ~100 servers which trigger the registration of a new reverse proxy rule using the caddy admin API. This means that every reverse proxy rule serves/act as a gateway to a completely independent web server which is used by some websockets clients that only communicates with their related web server.

Some of the ~100 servers have an unreliable network, which causes the config to changes frequently, i.e. each time a machine disconnects, we update the caddy config to remove the reverse proxy rule related to the machine that disconnected. Vice versa, when a machine connects to the network, we update the caddy config to add the reverse proxy rule related to that machine. We are required to remove the reverse proxy rule when a server disconnects because it allows us to show a custom error page to inform the related users that their server is offline (that's the purpose of the "@id": "srv0-error" rule given in the config above).

Each configuration change in caddy leads to a websockets closing. So in our case, the servers with unreliable network connections impact directly the other, independent, web servers and their web clients that are using websockets because each time a server disconnects, it triggers a caddy config reload and this leads to a websocket connection cut for every other users that have a stable network.

I think that caddy should only free the resources (including websockets) that are affected by changes made in the configuration. It's normal to close/clean sockets that are no longer part of the config, but it's seems less relevant to close sockets or any other ressource that remain unchanged in the new configuration during a reload (as opposed to a restart for which this behavior would be justified)

francislavoie commented 6 days ago

Then what you should do is avoid reloading the config so often. You can implement a dynamic upstreams module, which would let you change the list of upstreams without changing the config.

I think that caddy should only free the resources (including websockets) that are affected by changes made in the configuration.

Way easier said than done. We can't do that. We don't diff configs, the next config has no clue what was in the previous config.

It's normal to close/clean sockets that are no longer part of the config, but it's seems less relevant to close sockets or any other ressource that remain unchanged in the new configuration during a reload (as opposed to a restart for which this behavior would be justified)

The connection is still active, and therefore is still literally inside a code path that has a reference to the config. It doesn't "return" from that scope until the connection is closed, so the reference stays, and therefore garbage collection can't happen.

francislavoie commented 6 days ago

Anyway I don't see a bug here, just sounds like a misunderstanding of the feature.

mholt commented 6 days ago

I'm also willing to work on a more proper websocket solution, but it might not be simple. We'd just need a sponsorship to do it. @daweedm Is this something your company would be willing to sponsor?

daweedm commented 5 days ago

Way easier said than done. We can't do that. We don't diff configs, the next config has no clue what was in the previous config.

I'm aware that open source projects like caddy have limited resources and there is a lot of other work you have to focus on. I just wanted to share an issue we've encountered in our entreprise grade application, where this kind of "detail" is a reason for us to switch to another software solution. I think it's always good for open source projects to be challenged with advanced use cases like the one we have since it may help the project to evolve, especially when this kind of topology with a lot of nodes having unreliable network is very popular and has a lot of application nowadays (load balancers, fail-overs, etc).

Anyway I don't see a bug here, just sounds like a misunderstanding of the feature.

It's not a bug, it's a missing feature (: Anyway thanks for your help !

@mholt Tanks for your proposition. TBH it depends on the cost naturally. If it exceeds the cost we may support to switch to e.g. the paid version of nginx which already has this feature, it will be more difficult for us to choose to sponsor that development.

francislavoie commented 5 days ago

It's not a bug, it's a missing feature (:

I disagree, the feature is there. Just use a longer delay than 60 seconds.

Or like I said, use dynamic upstreams (use one of the existing modules or write your own) which would let you avoid config reloads most of the time, probably.

daweedm commented 5 days ago

It's not a bug, it's a missing feature (:

I disagree, the feature is there. Just use a longer delay than 60 seconds.

We can't put something like 1 day/week or more without risking to create a memory leak and run out of CPU/RAM ressources.

As you said: you don't do diff between configs change and that's I was pointing as a missing feature. This is a design choice you made, which is indeed easier to implement (as it's an hard reset). But going to the "diff approach" and cleaning only the resources that are no longer needed is objectively a better software design in terms of performance and moreover that's the way other well-known reverse proxies work. Thanks for your help and wish you a nice day

daweedm commented 5 days ago

It's not a bug, it's a missing feature (:

I disagree, the feature is there. Just use a longer delay than 60 seconds.

Or like I said, use dynamic upstreams (use one of the existing modules or write your own) which would let you avoid config reloads most of the time, probably.

We'll check the dynamic upstreams, thanks

mholt commented 4 days ago

@daweedm Can you post a sample config for nginx that does what you need?

How many nginx instances do/would you be running?

The cost mainly depends on the implementation. I'll try to figure out an approach and generate an estimate for you.