caddyserver / dist

Resources for packaging and distributing Caddy
Apache License 2.0
116 stars 118 forks source link

Auto restart caddy if it crashes under systemd #92

Closed ncw closed 1 year ago

ncw commented 1 year ago

Alas caddy does crash occasionally and the current systemd service file does not auto restart it which causes an outage.

This adds auto restart to the two caddy service files.

ncw commented 1 year ago

FYI this was the caddy problem

Dec 04 05:54:10 rclone-web caddy[28708]: runtime: failed to create new OS thread (have 513 already; errno=11)
Dec 04 05:54:10 rclone-web caddy[28708]: runtime: may need to increase max user processes (ulimit -u)
Dec 04 05:54:10 rclone-web caddy[28708]: fatal error: newosproc
Dec 04 05:54:10 rclone-web caddy[28708]: runtime stack:

Not sure why caddy should need > 512 OS threads - I suspect a thread leak. Happy for caddy to crash and be restarted by systemd at this point.

Caddy ran for 83 days before crashing, so if there is a leak it is quite slow!

mholt commented 1 year ago

Hey Nick, thanks for the contribution!

First, I'd like to understand why your process is crashing. I'm not sure I've seen this error before, or at least not recently. Can you share your config? Are you using any third-party plugins? What are your traffic patterns like?

We have several high-volume users who keep Caddy running for months with no problems, so I'm curious what is different or unique about your setup or traffic patterns! :)

francislavoie commented 1 year ago

Caddy returns specific exit codes, and it shouldn't be restarted in some cases, otherwise it could get stuck in infinite retry loops, potentially hitting rate limits for ACME and such. See https://caddyserver.com/docs/command-line#exit-codes

So for that reason, I don't think it's a good idea to add auto-restart in general, but it is reasonable to add it after you've set up Caddy and validated that the config is production ready. It can be added as an override with https://caddyserver.com/docs/running#overrides. I think we should probably document somewhere in that page that it's a decent idea to add auto-restart once set up and running.

mholt commented 1 year ago

I like that idea. Auto-restarts should be ok if you know your command and config are valid and correct.

@IndeedNotJames also pointed out in Slack that systemd has a RestartPreventExitStatus= directive that might be looking into.

Basically, Caddy shouldn't be auto-restarted under any circumstance when the exit code is 1. We use that code a lot, but less often in code paths that systemd tends to automate. But still...

ncw commented 1 year ago

@mholt

First, I'd like to understand why your process is crashing. I'm not sure I've seen this error before, or at least not recently. Can you share your config? Are you using any third-party plugins? What are your traffic patterns like?

My config is quite straight forward - its a whole load of static servers except for slack-invite which isn't running anyway.

rclone.org Caddyfile ``` # rclone web servers # rcl.one rclone.org, test.rclone.org { file_server { root /var/www/rclone.org } log { output file /var/www/logs/rclone.org.log { roll_size 100 # Rotate after 100 MB } } encode gzip tls nick@craig-wood.com } slack-invite.rclone.org { reverse_proxy localhost:3001 log { output file /var/www/logs/slack-invite.rclone.org.log { roll_size 100 # Rotate after 100 MB } } } oauth.rclone.org { redir http://127.0.0.1:53682{uri} } www.rclone.org:80, rclone.com:80, www.rclone.com:80 { redir https://rclone.org log { output file /var/www/logs/www.rclone.org.log { roll_size 100 # Rotate after 100 MB } } # tls nick@craig-wood.com } beta.rclone.org { file_server browse { root /mnt/beta.rclone.org } log { output file /var/www/logs/beta.rclone.org.log { roll_size 100 # Rotate after 100 MB } } tls nick@craig-wood.com } downloads.rclone.org { file_server browse { root /var/www/downloads.rclone.org } log { output file /var/www/logs/downloads.rclone.org.log { roll_size 100 # Rotate after 100 MB } } tls nick@craig-wood.com } pub.rclone.org { file_server browse { root /var/www/pub.rclone.org } log { output file /var/www/logs/pub.rclone.org.log { roll_size 100 # Rotate after 100 MB } } tls nick@craig-wood.com } tip.rclone.org { file_server browse { root /var/www/tip.rclone.org } log { output file /var/www/logs/tip.rclone.org.log { roll_size 100 # Rotate after 100 MB } } tls nick@craig-wood.com } apt.rclone.org { file_server browse { root /var/www/apt.rclone.org } log { output file /var/www/logs/apt.rclone.org.log { roll_size 100 # Rotate after 100 MB } } tls nick@craig-wood.com } beta.apt.rclone.org { file_server browse { root /var/www/beta.apt.rclone.org } log { output file /var/www/logs/beta.apt.rclone.org.log { roll_size 100 # Rotate after 100 MB } } tls nick@craig-wood.com } gpython.org { file_server { root /var/www/gpython.org } log { output file /var/www/logs/gpython.org.log { roll_size 100 # Rotate after 100 MB } } encode gzip tls nick@craig-wood.com } www.gpython.org:80, gpython.com:80, www.gpython.com:80 { redir https://gpython.org log { output file /var/www/logs/www.gpython.org.log { roll_size 100 # Rotate after 100 MB } } } ```

The only unusual thing about this server is that /var/www/beta.apt.rclone.org is an rclone mount of a swift object storage system.

I suspect there is some interaction between caddy and rclone mount which causes the problem. It's probably rclone's fault!

We have several high-volume users who keep Caddy running for months with no problems, so I'm curious what is different or unique about your setup or traffic patterns! :)

rclone.org serves about 1TB a day so quite busy but not astronomical.

@francislavoie

Would you be happy if I updated the PR to add RestartPreventExitStatus=

I think

RestartPreventExitStatus=1

Would do exactly the right thing.

I did not know about the https://caddyserver.com/docs/running#overrides page! I guess it creates a separate file which won't get overwritten by upgrades? Where does it store it?

If the consensus is that this is better left to user preference, then I can certainly use that mechanism, though it is one more thing to remember when setting up a new server!

francislavoie commented 1 year ago

I'd like to hear from @carlwgeorge as well since he knows systemd quirks more intimately than most of us, as a COPR maintainer.

Re overrides, yeah, it makes a file in /etc/systemd somewhere (if you run the command, it'll tell you where exactly) whereas the package installed one is in /lib/systemd/system/caddy.service.

carlwgeorge commented 1 year ago

So for that reason, I don't think it's a good idea to add auto-restart in general, but it is reasonable to add it after you've set up Caddy and validated that the config is production ready. It can be added as an override with https://caddyserver.com/docs/running#overrides.

This sums up my feelings perfectly.

francislavoie commented 1 year ago

Okay, I'll close this.

@ncw if you'd like to take a crack at it, you could contribute a change to the docs to document how to set up this override? The docs are here: https://github.com/caddyserver/website/blob/master/src/docs/markdown/running.md

Thanks for the proposal!

ncw commented 1 year ago

I've done that here: https://github.com/caddyserver/website/pull/284

(I did it with the web GUI as an experiment as that is what I ask rclone users to do and it seemed to work well!)

mholt commented 1 year ago

Excellent. Thanks for the great discussion. I agree with the conclusion! Appreciate your chiming in, Nick!