caddyserver / caddy

Fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS
https://caddyserver.com
Apache License 2.0
58.12k stars 4.03k forks source link

Running two container instances of Caddy with shared config storage can result in a split-brain for configuration changes #5954

Open angrygreenfrogs opened 11 months ago

angrygreenfrogs commented 11 months ago

According to the docs, you should be able to run X number of Caddy instances that share a storage mount for their configuration files and I believe they're supposed to gracefully handle locking, detecting config changes, etc.

My setup is a Caddy container running inside of an Azure Web App container service (2 instances, automatically load balanced) + an Azure file share for shared storage (internally this is mounted via SMB).

I've noticed this seems to work perfectly fine if I only make config changes locally via SSH to the primary container. The 2nd container seems to pick those changes up correctly and they have a consistent configuration.

I also wanted to expose access to the API from external sources so that we can build some internal automation around it for various reasons. I created a reverse proxy from https://external.url/api to localhost:2019 with a password on it. That works fine.

However, this also means that API requests may land on either of the running containers, potentially at almost the same time (so possible race condition).

I've found that if I'm quickly sending API requests through the load balancer, I end up with half of the config on instance A and the other half of the config on instance B.

As above, if I make the same changes via SSH to just instance A, then everything works fine and instance B picks up on it.

Here's a full example showing how I've setup the instances and at the end is a bash FOR loop that shows how I might be quickly creating a bunch of sites via API requests and ultimately I end up with the split-brain scenario: github-explanation.txt

Please let me know if I can supply any other information.

francislavoie commented 11 months ago

According to the docs, you should be able to run X number of Caddy instances that share a storage mount for their configuration files and I believe they're supposed to gracefully handle locking, detecting config changes, etc.

No, we never claim that.

You may share /data, which Caddy will use to coordinate certificate issuance using files as locks.

But if you want to share config across instances, it's up to you to push config updates to every instance.

You may use https://caddyserver.com/docs/json/admin/remote/ to enable a secure admin endpoint for remote access.

You may also use https://caddyserver.com/docs/json/admin/config/load/ to pull an up to date config from an HTTP endpoint (or write your own plugin to provide a config).

angrygreenfrogs commented 11 months ago

Thank you for that clarification!

I might only suggest the team consider adding an explicit statement about that to the docs? It seems like it'd be a very common use case/issue, considering Caddy provides an API for managing it's configuration - it feels weird that I can't use that API and be able to easily share config changes automatically across more than one instance.

I feel like it's implied in a number of places, and so I kept feeling like this should work, but I see what you mean that there's no clear statement that config changes are in fact automatically handled across a cluster.

e.g. I was reading things like this: https://caddyserver.com/docs/architecture#managing-configuration https://caddy.community/t/load-balancing-caddy/10467

mholt commented 11 months ago

Yes, I agree -- we can do better at clarifying this.

For now, I would recommend pushing your config to all your instances; or you can use a config loader module to have each instance pull configs, and load_delay can make it a repetitive loop.

There's also this module @mohammed90 started working on yesterday, partly inspired by your request I think: https://github.com/mohammed90/caddy-storage-loader

angrygreenfrogs commented 11 months ago

Thanks again, @mholt, all!

I can definitely see the components are there that could allow you to manage multiple instances that share a common config, at least for the use case of a configuration that is modified directly at the central storage location.

The trickier use case, my original goal, was for being able to use the remote management API directly on a cluster of servers. This is where you run into the problems around API requests possibly going to any cluster member, where that member may not be in sync with the others.

Even if you use the config loader/load_delay features, you could still run into any number of obvious race conditions.

Today, the only way I can see to do that out-of-the-box would be if you had a separate single "config" caddy server that existed purely to act as the remote management API endpoint, and as the source of truth for the config loader that the cluster pulls from. That would work, and it's not a bad option really. A single Caddy instance can already handle so much traffic that you'd have to be pretty serious about creating a highly scalable system at that point.

For being able to have a self-contained cluster that's API manageable.... it'd be some dev work. Thinking about it this morning, I suppose it'd go something like:

I'm not familiar with the Caddy code as I'm very new to this.. but it sounds like you guys already have some similar work happening around certificate management... that could perhaps be adapted to work for config management as well?

In any case, realize you all have enough on your plates, so just hope this thread is useful for any others who run into the same query, and yes, at least some basic notes/mentions of the current state of affairs in the docs I'm sure would be a big help for others.

mholt commented 11 months ago

Even if you use the config loader/load_delay features, you could still run into any number of obvious race conditions.

I am not banging on all cylinders today... :sweat_smile: What race conditions are there, exactly? The instances will load the configs on the load_delay timer, no matter how fast you update the config in storage.

Today, the only way I can see to do that out-of-the-box would be if you had a separate single "config" caddy server that existed purely to act as the remote management API endpoint, and as the source of truth for the config loader that the cluster pulls from. That would work, and it's not a bad option really.

True -- and you can already do this, and I feel like it's kind of the same as just pulling from storage directly, except instead of sharing the storage you're simply calling out to an HTTP endpoint (a Caddy instance) that returns the config from its local disk.

I'm not familiar with the Caddy code as I'm very new to this.. but it sounds like you guys already have some similar work happening around certificate management... that could perhaps be adapted to work for config management as well?

Yeah. We could look into this too. That's probably enough work that I'd recommend a sponsorship to cover the development. But happy to look into this!