Closed daweedm closed 6 days ago
Thanks for the issue. Do you have a minimal reproducer though? Because the web socket delay timeout works for me...
Thanks for your quick response.
The issue we encouter are the websockets being closed on configuration reload but by reading again #5471, I may didn't understood the purpose of the stream_close_delay
option.
Does this option ensure that websockets are not closed at all on config reload (which is what I'm looking for) or that they are always closed but after a given delay timeout ?
stream_close_delay
defers the closing of websockets until the end of the delay, starting from the config reload. The point is to allow connections to "die off" naturally but still set a limit for how long they'll stick around.
This is important because otherwise, a connection that stays open forever will cause a reference to the old config to stay in memory and never get cleaned up until all the connections are closed.
Setting it is basically saying "yes I understand memory usage may increase because old configs will stick around and not get garbage collected until the delay is over". Just set it to something long enough that your connections can reasonably die off naturally, but not so long that memory usage from config reloads becomes a concern.
But either way, your apps should be set up such that a disconnect will trigger a reconnect attempt. Not having client-side reconnections is a bug.
Our websocket clients have already a reconnection mechanism implemented. The problem is the reconnection is happening too frequently.
In our case, we use a caddy server as a central point to access several unrelated between each other web servers. We have ~100 servers which trigger the registration of a new reverse proxy rule using the caddy admin API. This means that every reverse proxy rule serves/act as a gateway to a completely independent web server which is used by some websockets clients that only communicates with their related web server.
Some of the ~100 servers have an unreliable network, which causes the config to changes frequently, i.e. each time a machine disconnects, we update the caddy config to remove the reverse proxy rule related to the machine that disconnected. Vice versa, when a machine connects to the network, we update the caddy config to add the reverse proxy rule related to that machine. We are required to remove the reverse proxy rule when a server disconnects because it allows us to show a custom error page to inform the related users that their server is offline (that's the purpose of the "@id": "srv0-error"
rule given in the config above).
Each configuration change in caddy leads to a websockets closing. So in our case, the servers with unreliable network connections impact directly the other, independent, web servers and their web clients that are using websockets because each time a server disconnects, it triggers a caddy config reload and this leads to a websocket connection cut for every other users that have a stable network.
I think that caddy should only free the resources (including websockets) that are affected by changes made in the configuration. It's normal to close/clean sockets that are no longer part of the config, but it's seems less relevant to close sockets or any other ressource that remain unchanged in the new configuration during a reload (as opposed to a restart for which this behavior would be justified)
Then what you should do is avoid reloading the config so often. You can implement a dynamic upstreams module, which would let you change the list of upstreams without changing the config.
I think that caddy should only free the resources (including websockets) that are affected by changes made in the configuration.
Way easier said than done. We can't do that. We don't diff configs, the next config has no clue what was in the previous config.
It's normal to close/clean sockets that are no longer part of the config, but it's seems less relevant to close sockets or any other ressource that remain unchanged in the new configuration during a reload (as opposed to a restart for which this behavior would be justified)
The connection is still active, and therefore is still literally inside a code path that has a reference to the config. It doesn't "return" from that scope until the connection is closed, so the reference stays, and therefore garbage collection can't happen.
Anyway I don't see a bug here, just sounds like a misunderstanding of the feature.
I'm also willing to work on a more proper websocket solution, but it might not be simple. We'd just need a sponsorship to do it. @daweedm Is this something your company would be willing to sponsor?
Way easier said than done. We can't do that. We don't diff configs, the next config has no clue what was in the previous config.
I'm aware that open source projects like caddy have limited resources and there is a lot of other work you have to focus on. I just wanted to share an issue we've encountered in our entreprise grade application, where this kind of "detail" is a reason for us to switch to another software solution. I think it's always good for open source projects to be challenged with advanced use cases like the one we have since it may help the project to evolve, especially when this kind of topology with a lot of nodes having unreliable network is very popular and has a lot of application nowadays (load balancers, fail-overs, etc).
Anyway I don't see a bug here, just sounds like a misunderstanding of the feature.
It's not a bug, it's a missing feature (: Anyway thanks for your help !
@mholt Tanks for your proposition. TBH it depends on the cost naturally. If it exceeds the cost we may support to switch to e.g. the paid version of nginx which already has this feature, it will be more difficult for us to choose to sponsor that development.
It's not a bug, it's a missing feature (:
I disagree, the feature is there. Just use a longer delay than 60 seconds.
Or like I said, use dynamic upstreams (use one of the existing modules or write your own) which would let you avoid config reloads most of the time, probably.
It's not a bug, it's a missing feature (:
I disagree, the feature is there. Just use a longer delay than 60 seconds.
We can't put something like 1 day/week or more without risking to create a memory leak and run out of CPU/RAM ressources.
As you said: you don't do diff between configs change and that's I was pointing as a missing feature. This is a design choice you made, which is indeed easier to implement (as it's an hard reset). But going to the "diff approach" and cleaning only the resources that are no longer needed is objectively a better software design in terms of performance and moreover that's the way other well-known reverse proxies work. Thanks for your help and wish you a nice day
It's not a bug, it's a missing feature (:
I disagree, the feature is there. Just use a longer delay than 60 seconds.
Or like I said, use dynamic upstreams (use one of the existing modules or write your own) which would let you avoid config reloads most of the time, probably.
We'll check the dynamic upstreams, thanks
@daweedm Can you post a sample config for nginx that does what you need?
How many nginx instances do/would you be running?
The cost mainly depends on the implementation. I'll try to figure out an approach and generate an estimate for you.
1. Environment
1a. Operating system and version
Docker version 20.10.21, build baeda1f
1b. Caddy version (run
caddy version
or paste commit SHA)2. Description
2a. What happens (briefly explain what is wrong)
Websocket connections are still closed on config reload despite #5471. We added the
stream_close_delay
flag in our config. But once we add a new handle using the API, we notice that all of websocket clients get disconnected for a short time. This is systematic : we made a script that triggers the API call and we clearly notice the systematic disconnect on all websocket client's logs.2d. Workaround(s)
None
2e. Relevant links
5471
3. Tutorial (minimal steps to reproduce the bug)
Example of configuration :