Open WGH- opened 3 months ago
Thanks for opening an issue! We'll look into this.
It's not immediately clear to me what is going on, so I'll need your help to understand it better.
Ideally, we need to be able to reproduce the bug in the most minimal way possible using the latest version of Caddy. This allows us to write regression tests to verify the fix is working. If we can't reproduce it, then you'll have to test our changes for us until it's fixed -- and then we can't add test cases, either.
I've attached a template below that will help make this easier and faster! This will require some effort on your part -- please understand that we will be dedicating time to fix the bug you are reporting if you can just help us understand it and reproduce it easily.
This template will ask for some information you've already provided; that's OK, just fill it out the best you can. :+1: I've also included some helpful tips below the template. Feel free to let me know if you have any questions!
Thank you again for your report, we look forward to resolving it!
## 1. Environment
### 1a. Operating system and version
```
paste here
```
### 1b. Caddy version (run `caddy version` or paste commit SHA)
This should be the latest version of Caddy:
```
paste here
```
## 2. Description
### 2a. What happens (briefly explain what is wrong)
### 2b. Why it's a bug (if it's not obvious)
### 2c. Log output
```
paste terminal output or logs here
```
### 2d. Workaround(s)
### 2e. Relevant links
## 3. Tutorial (minimal steps to reproduce the bug)
Environment: Please fill out your OS and Caddy versions, even if you don't think they are relevant. (They are always relevant.) If you built Caddy from source, provide the commit SHA and specify your exact Go version.
Description: Describe at a high level what the bug is. What happens? Why is it a bug? Not all bugs are obvious, so convince readers that it's actually a bug.
Tutorial: What are the minimum required specific steps someone needs to take in order to experience the same bug? Your goal here is to make sure that anyone else can have the same experience with the bug as you do. You are writing a tutorial, so make sure to carry it out yourself before posting it. Please:
curl
.Example of a tutorial:
Create a config file: ``` { ... } ``` Open terminal and run Caddy: ``` $ caddy ... ``` Make an HTTP request: ``` $ curl ... ``` Notice that the result is ___ but it should be ___.
I have a theory. The bug is here, in this very line: https://github.com/caddyserver/certmagic/blob/3bad5b6bb595b09c14bd86ff0b365d302faaf5e2/maintain.go#L364
When stapleOCSP
fails to write the staple file, which is not actually a hard error since cert.ocsp
is still updated, RenewManagedCertificates
skips the rest of the steps, and doesn't update the staple (see for certKey, update := range updated
loop).
When the server is started, the error is logged, but otherwise ignored: https://github.com/caddyserver/certmagic/blob/3bad5b6bb595b09c14bd86ff0b365d302faaf5e2/certificates.go#L331-L334. That explains why Caddy works fine after restarting, but breaks down much later.
I think the fix is to set HOME
environment variable to a writable directory (services having $HOME
is weird, but I dunno). I'll report back in a week or so, and tell whether this helped.
Please fill out the template as asked. How are you running Caddy, etc? It matters. We can save time that way.
Debian 11
caddy version
or paste commit SHA)This should be the latest version of Caddy:
v2.8.4 h1:q3pe0wpBj1OcHFZ3n/1nl4V4bxBrYoSoab7rL9BMYNk=
In unknown circumstances (theory available), after some uptime, Caddy is unable to serve up to date OCSP staple.
Caddy should always serve up to date OCSP staple, if OCSP stapling is enabled.
Aug 17 17:00:07 subdomain.example.com systemd[1]: Starting foobar.service...
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"warn","ts":1723914008.7447097,"msg":"unable to determine directory for user configuration; falling back to current directory","error":"neither $XDG_CONFIG_HOME nor $HOME are defined"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7498074,"msg":"using adjacent Caddyfile"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.754352,"msg":"adapted config to JSON","adapter":"caddyfile"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7640152,"logger":"admin","msg":"admin endpoint started","address":"localhost:2019","enforce_origin":false,"origins":["//localhost:2019","//[::1]:2019","//127.0.0.1:2019"]}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7667134,"logger":"tls.cache.maintenance","msg":"started background certificate maintenance","cache":"0xc0000e9a80"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"warn","ts":1723914008.7858684,"logger":"tls","msg":"stapling OCSP","error":"unable to write OCSP staple file for [example.com subdomain.example.com wgh.example.com]: mkdir caddy: not a directory"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7866526,"logger":"http.auto_https","msg":"automatic HTTPS is completely disabled for server","server_name":"srv0"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7873237,"logger":"http.handlers.forward_proxy","msg":"Secret domain used to connect to proxy: subdomain.example.com"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7880044,"logger":"http","msg":"enabling HTTP/3 listener","addr":":8443"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.788521,"msg":"failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 7168 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes for details."}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7890544,"logger":"http.log","msg":"server running","name":"srv0","protocols":["h1","h2","h3"]}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"error","ts":1723914008.7896967,"msg":"unable to create folder for config autosave","dir":"caddy","error":"mkdir caddy: not a directory"}
Aug 17 17:00:08 subdomain.example.com systemd[1]: Started foobar.service.
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"warn","ts":1723914008.7882998,"logger":"tls","msg":"unable to get instance ID; storage clean stamps will be incomplete","error":"open caddy/instance.uuid: not a directory"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"error","ts":1723914008.800019,"logger":"tls","msg":"could not clean default/global storage","error":"unable to acquire storage_clean lock: creating lock file: open caddy/locks/storage_clean.lock: not a directory"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.800352,"logger":"tls","msg":"finished cleaning storage units"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7997715,"msg":"serving initial configuration"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"warn","ts":1723914008.8008697,"msg":"$HOME environment variable is empty - please fix; some assets might be stored in ./caddy"}
Aug 17 17:20:43 subdomain.example.com systemd[1]: Reloading foobar.service.
Aug 17 17:20:43 subdomain.example.com caddy[3488281]: {"level":"warn","ts":1723915243.7769516,"msg":"unable to determine directory for user configuration; falling back to current directory","error":"neither $XDG_CONFIG_HOME nor $HOME are defined"}
Aug 17 17:20:43 subdomain.example.com caddy[3488281]: {"level":"info","ts":1723915243.7871993,"msg":"using adjacent Caddyfile"}
Aug 17 17:20:43 subdomain.example.com caddy[3488281]: {"level":"info","ts":1723915243.7911592,"msg":"adapted config to JSON","adapter":"caddyfile"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.8025868,"logger":"admin.api","msg":"received request","method":"POST","host":"localhost:2019","uri":"/load","remote_ip":"127.0.0.1","remote_port":"50174","headers":{"Accept-Encoding":["gzip"],"Cache-Control":["must-revalidate"],"Content-Length":["818"],"Content-Type":["application/json"],"Origin":["http://localhost:2019"],"User-Agent":["Go-http-client/1.1"]}}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.80679,"logger":"admin","msg":"admin endpoint started","address":"localhost:2019","enforce_origin":false,"origins":["//localhost:2019","//[::1]:2019","//127.0.0.1:2019"]}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"warn","ts":1723915243.869151,"logger":"tls","msg":"stapling OCSP","error":"unable to write OCSP staple file for [example.com subdomain.example.com wgh.example.com]: mkdir caddy: not a directory"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.8700619,"logger":"http.auto_https","msg":"automatic HTTPS is completely disabled for server","server_name":"srv0"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.8708005,"logger":"http.handlers.forward_proxy","msg":"Secret domain used to connect to proxy: subdomain.example.com"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.8716269,"logger":"http","msg":"enabling HTTP/3 listener","addr":":8443"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.8720572,"logger":"http.log","msg":"server running","name":"srv0","protocols":["h1","h2","h3"]}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.872618,"logger":"http","msg":"servers shutting down with eternal grace period"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"error","ts":1723915243.8733037,"msg":"unable to create folder for config autosave","dir":"caddy","error":"mkdir caddy: not a directory"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.8747997,"logger":"admin.api","msg":"load complete"}
Aug 17 17:20:43 subdomain.example.com systemd[1]: Reloaded foobar.service.
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.9524853,"logger":"admin","msg":"stopped previous server","address":"localhost:2019"}
Aug 18 17:20:43 subdomain.example.com caddy[3487890]: {"level":"warn","ts":1724001643.881339,"logger":"tls","msg":"unable to get instance ID; storage clean stamps will be incomplete","error":"open caddy/instance.uuid: not a directory"}
Aug 18 17:20:43 subdomain.example.com caddy[3487890]: {"level":"error","ts":1724001643.895825,"logger":"tls","msg":"could not clean default/global storage","error":"unable to acquire storage_clean lock: creating lock file: open caddy/locks/storage_clean.lock: not a directory"}
Aug 18 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1724001643.8962796,"logger":"tls","msg":"finished cleaning storage units"}
Aug 18 19:00:08 subdomain.example.com caddy[3487890]: {"level":"error","ts":1724007608.9948008,"logger":"tls.cache.maintenance","msg":"stapling OCSP","identifiers":["example.com","subdomain.example.com","wgh.example.com"],"error":"unable to write OCSP staple file for [example.com subdomain.example.com wgh.example.com]: mkdir caddy: not a directory"}
Aug 18 19:38:30 subdomain.example.com caddy[3487890]: {"level":"info","ts":1724009910.9073286,"msg":"shutting down apps, then terminating","signal":"SIGTERM"}
Ensure Caddy has writable data directory (probably).
I have a theory. The bug is here, in this very line: https://github.com/caddyserver/certmagic/blob/3bad5b6bb595b09c14bd86ff0b365d302faaf5e2/maintain.go#L364
When stapleOCSP
fails to write the staple file, which is not actually a hard error since cert.ocsp
is still updated, RenewManagedCertificates
skips the rest of the steps, and doesn't update the staple (see for certKey, update := range updated
loop).
When the server is started, the error is logged, but otherwise ignored: https://github.com/caddyserver/certmagic/blob/3bad5b6bb595b09c14bd86ff0b365d302faaf5e2/certificates.go#L331-L334. That explains why Caddy works fine after restarting, but breaks down much later.
This tutorial might not be minimal.
Caddyfile
:
{
auto_https off
}
https://:8443 {
tls /var/lib/acme/live/example.com/fullchain /var/lib/acme/live/example.com/privkey {
}
log {
output stderr
}
route {
forward_proxy {
probe_resistance subdomain.example.com
basic_auth <redacted>
hide_ip
hide_via
}
file_server
}
}
Having externally managed certificate might not be the required step, but
that's the configuration I have. It's managed by acmetool
, and obtained
via Let's Encrypt.
Both Caddyfile
and caddy
executable are in /opt/foobar
. All of them,
including the directory, are owned by root.
systemd unit:
# /etc/systemd/system/foobar.service
[Unit]
After=network.target
[Service]
DynamicUser=true
SupplementaryGroups=ssl-cert
ExecStart=/opt/foobar/caddy run
ExecReload=/opt/foobar/caddy reload --force
WorkingDirectory=/opt/foobar/
Type=notify
PrivateTmp=true
ProtectSystem=full
[Install]
WantedBy=multi-user.target
Having DynamicUser=true
, this service will have no $HOME
set. You could probably
create a system user with no writable home (like /
or /nonexistent
) to
the same effect. Note that, as mentioned above, the current directory
is also nonwritable, and it also already contains a file named caddy
(Caddy
tries to use ./caddy/
as a data directory as fallback).
Now, start Caddy. It will work fine.
Then check when OCSP staple will expire: openssl s_client -connect subdomain.example.com:8443 -status
.
This will likely be a week from now, so you have to wait.
After that, try to open this with Firefox, which is known to verify OCSP staples. It will return an error.
Why are you using DynamicUser=true
? Our recommendation has always been to run with a caddy
user. See https://caddyserver.com/docs/install#debian-ubuntu-raspbian and https://caddyserver.com/docs/running#linux-service
Why are you using
DynamicUser=true
? Our recommendation has always been to run with acaddy
user. See https://caddyserver.com/docs/install#debian-ubuntu-raspbian and https://caddyserver.com/docs/running#linux-service
I know. I just don't want to create a user upfront. (in any case, that's likely not a problem - lack of writable home directory is)
I'm inclined to close this. It's not really a Caddy issue, moreso invalid system configuration.
Perhaps non-writeable data dir should be a hard error on start? We get confusing issues otherwise. Especially like "this appears to work, but will fail after a week".
We already do a check on the storage system:
Aha, I see it's called by obtainCert
and renewCert
, but in my configuration, Caddy doesn't obtain certificates.
Oh... that's a good point. And an unusual use case.
I think in order to properly fix this we should implement the TODO:
// TODO: this is not necessary every time; should only perform check once every so often for each storage, which may require some global state...
Because OCSP ops are a lot more common than cert renewals.
Basically, we should cache/remember the storage checks for... the lifetime of the process, I guess? Or at least a day or something like that.
Then this problem would be solved.
I'm running Caddy with TLS by reusing certificate obtained externally with acmetool:
After a while, Firefox stops opening the site, citing stale OCSP response. Restarting Caddy fixes the problem. Reloading doesn't help.
I noticed the following error in the log file, which might be related to the problem: