caddyserver / caddy

Fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS
https://caddyserver.com
Apache License 2.0
58.64k stars 4.05k forks source link

Stale OCSP response and "unable to write OCSP staple file" error #6524

Open WGH- opened 3 months ago

WGH- commented 3 months ago

I'm running Caddy with TLS by reusing certificate obtained externally with acmetool:

{
        auto_https off
}
https://:8443 {
        tls /var/lib/acme/live/example.com/fullchain /var/lib/acme/live/example.com/privkey {
        }
        // ...
}

After a while, Firefox stops opening the site, citing stale OCSP response. Restarting Caddy fixes the problem. Reloading doesn't help.

I noticed the following error in the log file, which might be related to the problem:

{"level":"error","ts":1723840048.373416,"logger":"tls.cache.maintenance","msg":"stapling OCSP","identifiers":["example.com"],"error":"unable to write OCSP staple file for [example.com]: mkdir caddy: not a directory"}
mholt commented 3 months ago

Thanks for opening an issue! We'll look into this.

It's not immediately clear to me what is going on, so I'll need your help to understand it better.

Ideally, we need to be able to reproduce the bug in the most minimal way possible using the latest version of Caddy. This allows us to write regression tests to verify the fix is working. If we can't reproduce it, then you'll have to test our changes for us until it's fixed -- and then we can't add test cases, either.

I've attached a template below that will help make this easier and faster! This will require some effort on your part -- please understand that we will be dedicating time to fix the bug you are reporting if you can just help us understand it and reproduce it easily.

This template will ask for some information you've already provided; that's OK, just fill it out the best you can. :+1: I've also included some helpful tips below the template. Feel free to let me know if you have any questions!

Thank you again for your report, we look forward to resolving it!

Template

## 1. Environment

### 1a. Operating system and version

```
paste here
```

### 1b. Caddy version (run `caddy version` or paste commit SHA)

This should be the latest version of Caddy:

```
paste here
```

## 2. Description

### 2a. What happens (briefly explain what is wrong)

### 2b. Why it's a bug (if it's not obvious)

### 2c. Log output

```
paste terminal output or logs here
```

### 2d. Workaround(s)

### 2e. Relevant links

## 3. Tutorial (minimal steps to reproduce the bug)

Instructions -- please heed otherwise we cannot help you (help us help you!)

  1. Environment: Please fill out your OS and Caddy versions, even if you don't think they are relevant. (They are always relevant.) If you built Caddy from source, provide the commit SHA and specify your exact Go version.

  2. Description: Describe at a high level what the bug is. What happens? Why is it a bug? Not all bugs are obvious, so convince readers that it's actually a bug.

    • 2c) Log output: Paste terminal output and/or complete logs in a code block. DO NOT REDACT INFORMATION except for credentials. Please enable debug and access logs.
    • 2d) Workaround: What are you doing to work around the problem in the meantime? This can help others who encounter the same problem, until we implement a fix.
    • 2e) Relevant links: Please link to any related issues, pull requests, docs, and/or discussion. This can add crucial context to your report.
  3. Tutorial: What are the minimum required specific steps someone needs to take in order to experience the same bug? Your goal here is to make sure that anyone else can have the same experience with the bug as you do. You are writing a tutorial, so make sure to carry it out yourself before posting it. Please:

    • Start with an empty config. Add only the lines/parameters that are absolutely required to reproduce the bug.
    • Do not run Caddy inside containers.
    • Run Caddy manually in your terminal; do not use systemd or other init systems.
    • If making HTTP requests, avoid web browsers. Use a simpler HTTP client instead, like curl.
    • Do not redact any information from your config (except credentials). Domain names are public knowledge and often necessary for quick resolution of an issue!
    • Note that ignoring this advice may result in delays, or even in your issue being closed. 😞 Only actionable issues are kept open, and if there is not enough information or clarity to reproduce the bug, then the report is not actionable.

Example of a tutorial:

Create a config file: ``` { ... } ``` Open terminal and run Caddy: ``` $ caddy ... ``` Make an HTTP request: ``` $ curl ... ``` Notice that the result is ___ but it should be ___.
WGH- commented 3 months ago

I have a theory. The bug is here, in this very line: https://github.com/caddyserver/certmagic/blob/3bad5b6bb595b09c14bd86ff0b365d302faaf5e2/maintain.go#L364

When stapleOCSP fails to write the staple file, which is not actually a hard error since cert.ocsp is still updated, RenewManagedCertificates skips the rest of the steps, and doesn't update the staple (see for certKey, update := range updated loop).

When the server is started, the error is logged, but otherwise ignored: https://github.com/caddyserver/certmagic/blob/3bad5b6bb595b09c14bd86ff0b365d302faaf5e2/certificates.go#L331-L334. That explains why Caddy works fine after restarting, but breaks down much later.

WGH- commented 3 months ago

I think the fix is to set HOME environment variable to a writable directory (services having $HOME is weird, but I dunno). I'll report back in a week or so, and tell whether this helped.

francislavoie commented 3 months ago

Please fill out the template as asked. How are you running Caddy, etc? It matters. We can save time that way.

WGH- commented 3 months ago

1. Environment

1a. Operating system and version

Debian 11

1b. Caddy version (run caddy version or paste commit SHA)

This should be the latest version of Caddy:

v2.8.4 h1:q3pe0wpBj1OcHFZ3n/1nl4V4bxBrYoSoab7rL9BMYNk=

2. Description

2a. What happens (briefly explain what is wrong)

In unknown circumstances (theory available), after some uptime, Caddy is unable to serve up to date OCSP staple.

2b. Why it's a bug (if it's not obvious)

Caddy should always serve up to date OCSP staple, if OCSP stapling is enabled.

2c. Log output

Aug 17 17:00:07 subdomain.example.com systemd[1]: Starting foobar.service...
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"warn","ts":1723914008.7447097,"msg":"unable to determine directory for user configuration; falling back to current directory","error":"neither $XDG_CONFIG_HOME nor $HOME are defined"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7498074,"msg":"using adjacent Caddyfile"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.754352,"msg":"adapted config to JSON","adapter":"caddyfile"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7640152,"logger":"admin","msg":"admin endpoint started","address":"localhost:2019","enforce_origin":false,"origins":["//localhost:2019","//[::1]:2019","//127.0.0.1:2019"]}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7667134,"logger":"tls.cache.maintenance","msg":"started background certificate maintenance","cache":"0xc0000e9a80"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"warn","ts":1723914008.7858684,"logger":"tls","msg":"stapling OCSP","error":"unable to write OCSP staple file for [example.com subdomain.example.com wgh.example.com]: mkdir caddy: not a directory"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7866526,"logger":"http.auto_https","msg":"automatic HTTPS is completely disabled for server","server_name":"srv0"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7873237,"logger":"http.handlers.forward_proxy","msg":"Secret domain used to connect to proxy: subdomain.example.com"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7880044,"logger":"http","msg":"enabling HTTP/3 listener","addr":":8443"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.788521,"msg":"failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 7168 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes for details."}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7890544,"logger":"http.log","msg":"server running","name":"srv0","protocols":["h1","h2","h3"]}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"error","ts":1723914008.7896967,"msg":"unable to create folder for config autosave","dir":"caddy","error":"mkdir caddy: not a directory"}
Aug 17 17:00:08 subdomain.example.com systemd[1]: Started foobar.service.
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"warn","ts":1723914008.7882998,"logger":"tls","msg":"unable to get instance ID; storage clean stamps will be incomplete","error":"open caddy/instance.uuid: not a directory"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"error","ts":1723914008.800019,"logger":"tls","msg":"could not clean default/global storage","error":"unable to acquire storage_clean lock: creating lock file: open caddy/locks/storage_clean.lock: not a directory"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.800352,"logger":"tls","msg":"finished cleaning storage units"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7997715,"msg":"serving initial configuration"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"warn","ts":1723914008.8008697,"msg":"$HOME environment variable is empty - please fix; some assets might be stored in ./caddy"}
Aug 17 17:20:43 subdomain.example.com systemd[1]: Reloading foobar.service.
Aug 17 17:20:43 subdomain.example.com caddy[3488281]: {"level":"warn","ts":1723915243.7769516,"msg":"unable to determine directory for user configuration; falling back to current directory","error":"neither $XDG_CONFIG_HOME nor $HOME are defined"}
Aug 17 17:20:43 subdomain.example.com caddy[3488281]: {"level":"info","ts":1723915243.7871993,"msg":"using adjacent Caddyfile"}
Aug 17 17:20:43 subdomain.example.com caddy[3488281]: {"level":"info","ts":1723915243.7911592,"msg":"adapted config to JSON","adapter":"caddyfile"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.8025868,"logger":"admin.api","msg":"received request","method":"POST","host":"localhost:2019","uri":"/load","remote_ip":"127.0.0.1","remote_port":"50174","headers":{"Accept-Encoding":["gzip"],"Cache-Control":["must-revalidate"],"Content-Length":["818"],"Content-Type":["application/json"],"Origin":["http://localhost:2019"],"User-Agent":["Go-http-client/1.1"]}}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.80679,"logger":"admin","msg":"admin endpoint started","address":"localhost:2019","enforce_origin":false,"origins":["//localhost:2019","//[::1]:2019","//127.0.0.1:2019"]}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"warn","ts":1723915243.869151,"logger":"tls","msg":"stapling OCSP","error":"unable to write OCSP staple file for [example.com subdomain.example.com wgh.example.com]: mkdir caddy: not a directory"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.8700619,"logger":"http.auto_https","msg":"automatic HTTPS is completely disabled for server","server_name":"srv0"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.8708005,"logger":"http.handlers.forward_proxy","msg":"Secret domain used to connect to proxy: subdomain.example.com"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.8716269,"logger":"http","msg":"enabling HTTP/3 listener","addr":":8443"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.8720572,"logger":"http.log","msg":"server running","name":"srv0","protocols":["h1","h2","h3"]}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.872618,"logger":"http","msg":"servers shutting down with eternal grace period"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"error","ts":1723915243.8733037,"msg":"unable to create folder for config autosave","dir":"caddy","error":"mkdir caddy: not a directory"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.8747997,"logger":"admin.api","msg":"load complete"}
Aug 17 17:20:43 subdomain.example.com systemd[1]: Reloaded foobar.service.
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.9524853,"logger":"admin","msg":"stopped previous server","address":"localhost:2019"}
Aug 18 17:20:43 subdomain.example.com caddy[3487890]: {"level":"warn","ts":1724001643.881339,"logger":"tls","msg":"unable to get instance ID; storage clean stamps will be incomplete","error":"open caddy/instance.uuid: not a directory"}
Aug 18 17:20:43 subdomain.example.com caddy[3487890]: {"level":"error","ts":1724001643.895825,"logger":"tls","msg":"could not clean default/global storage","error":"unable to acquire storage_clean lock: creating lock file: open caddy/locks/storage_clean.lock: not a directory"}
Aug 18 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1724001643.8962796,"logger":"tls","msg":"finished cleaning storage units"}
Aug 18 19:00:08 subdomain.example.com caddy[3487890]: {"level":"error","ts":1724007608.9948008,"logger":"tls.cache.maintenance","msg":"stapling OCSP","identifiers":["example.com","subdomain.example.com","wgh.example.com"],"error":"unable to write OCSP staple file for [example.com subdomain.example.com wgh.example.com]: mkdir caddy: not a directory"}
Aug 18 19:38:30 subdomain.example.com caddy[3487890]: {"level":"info","ts":1724009910.9073286,"msg":"shutting down apps, then terminating","signal":"SIGTERM"}

2d. Workaround(s)

Ensure Caddy has writable data directory (probably).

2e. Relevant links

I have a theory. The bug is here, in this very line: https://github.com/caddyserver/certmagic/blob/3bad5b6bb595b09c14bd86ff0b365d302faaf5e2/maintain.go#L364

When stapleOCSP fails to write the staple file, which is not actually a hard error since cert.ocsp is still updated, RenewManagedCertificates skips the rest of the steps, and doesn't update the staple (see for certKey, update := range updated loop).

When the server is started, the error is logged, but otherwise ignored: https://github.com/caddyserver/certmagic/blob/3bad5b6bb595b09c14bd86ff0b365d302faaf5e2/certificates.go#L331-L334. That explains why Caddy works fine after restarting, but breaks down much later.

3. Tutorial (minimal steps to reproduce the bug)

This tutorial might not be minimal.

Caddyfile:

{
    auto_https off
}

https://:8443 {
    tls /var/lib/acme/live/example.com/fullchain /var/lib/acme/live/example.com/privkey {
    }
    log {
        output stderr
    }
    route {
        forward_proxy {
            probe_resistance subdomain.example.com
            basic_auth <redacted>
            hide_ip
            hide_via
        }
        file_server
    }
}

Having externally managed certificate might not be the required step, but that's the configuration I have. It's managed by acmetool, and obtained via Let's Encrypt.

Both Caddyfile and caddy executable are in /opt/foobar. All of them, including the directory, are owned by root.

systemd unit:

# /etc/systemd/system/foobar.service
[Unit]
After=network.target

[Service]
DynamicUser=true
SupplementaryGroups=ssl-cert
ExecStart=/opt/foobar/caddy run
ExecReload=/opt/foobar/caddy reload --force
WorkingDirectory=/opt/foobar/
Type=notify
PrivateTmp=true
ProtectSystem=full

[Install]
WantedBy=multi-user.target

Having DynamicUser=true, this service will have no $HOME set. You could probably create a system user with no writable home (like / or /nonexistent) to the same effect. Note that, as mentioned above, the current directory is also nonwritable, and it also already contains a file named caddy (Caddy tries to use ./caddy/ as a data directory as fallback).

Now, start Caddy. It will work fine.

Then check when OCSP staple will expire: openssl s_client -connect subdomain.example.com:8443 -status. This will likely be a week from now, so you have to wait.

After that, try to open this with Firefox, which is known to verify OCSP staples. It will return an error.

francislavoie commented 3 months ago

Why are you using DynamicUser=true? Our recommendation has always been to run with a caddy user. See https://caddyserver.com/docs/install#debian-ubuntu-raspbian and https://caddyserver.com/docs/running#linux-service

WGH- commented 3 months ago

Why are you using DynamicUser=true? Our recommendation has always been to run with a caddy user. See https://caddyserver.com/docs/install#debian-ubuntu-raspbian and https://caddyserver.com/docs/running#linux-service

I know. I just don't want to create a user upfront. (in any case, that's likely not a problem - lack of writable home directory is)

francislavoie commented 3 months ago

I'm inclined to close this. It's not really a Caddy issue, moreso invalid system configuration.

WGH- commented 3 months ago

Perhaps non-writeable data dir should be a hard error on start? We get confusing issues otherwise. Especially like "this appears to work, but will fail after a week".

mholt commented 3 months ago

We already do a check on the storage system:

https://github.com/caddyserver/certmagic/blob/3bad5b6bb595b09c14bd86ff0b365d302faaf5e2/config.go#L1162-L1200

WGH- commented 3 months ago

Aha, I see it's called by obtainCert and renewCert, but in my configuration, Caddy doesn't obtain certificates.

mholt commented 3 months ago

Oh... that's a good point. And an unusual use case.

I think in order to properly fix this we should implement the TODO:

// TODO: this is not necessary every time; should only perform check once every so often for each storage, which may require some global state...

Because OCSP ops are a lot more common than cert renewals.

Basically, we should cache/remember the storage checks for... the lifetime of the process, I guess? Or at least a day or something like that.

Then this problem would be solved.