Closed ncw closed 1 year ago
FYI this was the caddy problem
Dec 04 05:54:10 rclone-web caddy[28708]: runtime: failed to create new OS thread (have 513 already; errno=11)
Dec 04 05:54:10 rclone-web caddy[28708]: runtime: may need to increase max user processes (ulimit -u)
Dec 04 05:54:10 rclone-web caddy[28708]: fatal error: newosproc
Dec 04 05:54:10 rclone-web caddy[28708]: runtime stack:
Not sure why caddy should need > 512 OS threads - I suspect a thread leak. Happy for caddy to crash and be restarted by systemd at this point.
Caddy ran for 83 days before crashing, so if there is a leak it is quite slow!
Hey Nick, thanks for the contribution!
First, I'd like to understand why your process is crashing. I'm not sure I've seen this error before, or at least not recently. Can you share your config? Are you using any third-party plugins? What are your traffic patterns like?
We have several high-volume users who keep Caddy running for months with no problems, so I'm curious what is different or unique about your setup or traffic patterns! :)
Caddy returns specific exit codes, and it shouldn't be restarted in some cases, otherwise it could get stuck in infinite retry loops, potentially hitting rate limits for ACME and such. See https://caddyserver.com/docs/command-line#exit-codes
So for that reason, I don't think it's a good idea to add auto-restart in general, but it is reasonable to add it after you've set up Caddy and validated that the config is production ready. It can be added as an override with https://caddyserver.com/docs/running#overrides. I think we should probably document somewhere in that page that it's a decent idea to add auto-restart once set up and running.
I like that idea. Auto-restarts should be ok if you know your command and config are valid and correct.
@IndeedNotJames also pointed out in Slack that systemd has a RestartPreventExitStatus=
directive that might be looking into.
Basically, Caddy shouldn't be auto-restarted under any circumstance when the exit code is 1. We use that code a lot, but less often in code paths that systemd tends to automate. But still...
@mholt
First, I'd like to understand why your process is crashing. I'm not sure I've seen this error before, or at least not recently. Can you share your config? Are you using any third-party plugins? What are your traffic patterns like?
My config is quite straight forward - its a whole load of static servers except for slack-invite which isn't running anyway.
The only unusual thing about this server is that /var/www/beta.apt.rclone.org is an rclone mount
of a swift object storage system.
I suspect there is some interaction between caddy and rclone mount
which causes the problem. It's probably rclone's fault!
We have several high-volume users who keep Caddy running for months with no problems, so I'm curious what is different or unique about your setup or traffic patterns! :)
rclone.org serves about 1TB a day so quite busy but not astronomical.
@francislavoie
Would you be happy if I updated the PR to add RestartPreventExitStatus=
I think
RestartPreventExitStatus=1
Would do exactly the right thing.
I did not know about the https://caddyserver.com/docs/running#overrides page! I guess it creates a separate file which won't get overwritten by upgrades? Where does it store it?
If the consensus is that this is better left to user preference, then I can certainly use that mechanism, though it is one more thing to remember when setting up a new server!
I'd like to hear from @carlwgeorge as well since he knows systemd quirks more intimately than most of us, as a COPR maintainer.
Re overrides, yeah, it makes a file in /etc/systemd
somewhere (if you run the command, it'll tell you where exactly) whereas the package installed one is in /lib/systemd/system/caddy.service
.
So for that reason, I don't think it's a good idea to add auto-restart in general, but it is reasonable to add it after you've set up Caddy and validated that the config is production ready. It can be added as an override with https://caddyserver.com/docs/running#overrides.
This sums up my feelings perfectly.
Okay, I'll close this.
@ncw if you'd like to take a crack at it, you could contribute a change to the docs to document how to set up this override? The docs are here: https://github.com/caddyserver/website/blob/master/src/docs/markdown/running.md
Thanks for the proposal!
I've done that here: https://github.com/caddyserver/website/pull/284
(I did it with the web GUI as an experiment as that is what I ask rclone users to do and it seemed to work well!)
Excellent. Thanks for the great discussion. I agree with the conclusion! Appreciate your chiming in, Nick!
Alas caddy does crash occasionally and the current systemd service file does not auto restart it which causes an outage.
This adds auto restart to the two caddy service files.