caddyserver / caddy

Fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS
https://caddyserver.com
Apache License 2.0
57.96k stars 4.02k forks source link

Caddy is generating gigabytes of logs with: [INFO][FileStorage:*]Lock for * is stale #4448

Closed mildred closed 2 years ago

mildred commented 2 years ago

I have a caddy server that is blocked in a loop, generating thousands of message, it seems as fast as it is able to, filling the hard drive with gigabytes of logs until the disk is full.

The log line is:

Nov 26 22:03:13 mail caddy[1072]: 2021/11/26 22:02:07 [INFO][FileStorage:/var/lib/caddy] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /var/lib/caddy/locks/issue_cert_autoconfig.test.webmsg.me.lock

All log lines are exactly the same and I suspect hundreds lines per second.

Running caddy v2.3.0 h1:fnrqJLa3G5vfxcxmOH/+kJOcunPLhSBnjgIvjXV/QTA= obtained with xcaddy build --with github.com/mholt/caddy-l4@master --output /usr/local/bin/caddy


edit: It seems to occur in more recent version of Caddy too, and there are other log lines filling up the logs than this one. In the configuration that causes this error, Caddy by misconfiguration is assigned to obtain TLS certificates that do not redirect to this machine, resulting in a forever failure to obtain the certificates.

1. Environment

1a. Operating system and version

Debian GNU/Linux 11 (bullseye), running a KVM guest (libvirt) on a bare-metal Fedora host hosted on consumer hardware

1b. Caddy version (run caddy version or paste commit SHA)

v2.4.6 h1:HGkGICFGvyrodcqOOclHKfvJC0qTU7vny/7FhYp9hNw=

1c. Go version (if building Caddy from source; run go version)

downloaded from https://caddyserver.com/api/download?os=linux&arch=amd64&p=github.com%2Fmholt%2Fcaddy-l4

2. Description

2a. What happens (briefly explain what is wrong)

Caddy generated hundreds of log lines at INFO level, filling up the machine disk. More than 10G in a night.

2c. Log output

Too much to paste in full, see : https://github.com/caddyserver/caddy/issues/4448#issuecomment-985884595

2d. Workaround(s)

~None yet, tweak log rotation to at least be able to purge old logs more easily while keeping some of the logs.~

Cleaning up the locks directory, and especially removing the *.lock.unlock files where there are matching JSON *.lock files

2e. Relevant links

N/A

3. Tutorial (minimal steps to reproduce the bug)

Run this shell script (as root to be able to bind port 443):

#!/bin/sh

set -ex

CADDY=$PWD/caddy
CONFIG=$PWD/caddy-bug.json
CADDYHOME=$PWD/caddy-bug
DOMAIN=nxdomain.test.webmsg.me

mkdir -p $CADDYHOME/locks
touch $CADDYHOME/locks/issue_cert_$DOMAIN.lock.unlock
echo '{"created":"2021-11-25T10:29:08.415498241+01:00","updated":"2021-11-25T11:02:59.818772441+01:00"}' >$CADDYHOME/locks/issue_cert_$DOMAIN.lock

if ! [ -e $CADDY ]; then
    curl "https://caddyserver.com/api/download?os=linux&arch=amd64" -o $CADDY
    chmod +x $CADDY
fi

cat <<CONFIG >$CONFIG
{
  "apps": {
    "http": {
      "servers": {
        "master": {
          "listen": [
            "127.0.0.1:443"
          ],
          "routes": [
            {
              "handle": [
                {
                  "body": "TEST",
                  "handler": "static_response",
                  "headers": {
                    "Content-Type": [
                      "text/plain; charset=utf-8"
                    ]
                  }
                }
              ],
              "match": [
                {
                  "host": [
                    "$DOMAIN"
                  ]
                }
              ]
            }
          ]
        }
      }
    }
  },
  "logging": {
    "logs": {
      "default": {
        "level": "INFO",
        "writer": {
          "output": "stdout"
        }
      }
    }
  },
  "storage": {
    "module": "file_system",
    "root": "$CADDYHOME"
  }
}
CONFIG

$CADDY run --watch --config $CONFIG
mholt commented 2 years ago

Thanks for opening an issue! We'll look into this.

It's not immediately clear to me what is going on, so I'll need your help to understand it better.

Ideally, we need to be able to reproduce the bug in the most minimal way possible. This allows us to write regression tests to verify the fix is working. If we can't reproduce it, then you'll have to test our changes for us until it's fixed -- and then we can't add test cases, either.

I've attached a template below that will help make this easier and faster! This will require some effort on your part -- please understand that we will be dedicating time to fix the bug you are reporting if you can just help us understand it and reproduce it easily.

This template will ask for some information you've already provided; that's OK, just fill it out the best you can. :+1: I've also included some helpful tips below the template. Feel free to let me know if you have any questions!

Thank you again for your report, we look forward to resolving it!

Template

## 1. Environment

### 1a. Operating system and version

```
paste here
```

### 1b. Caddy version (run `caddy version` or paste commit SHA)

```
paste here
```

### 1c. Go version (if building Caddy from source; run `go version`)

```
paste here
```

## 2. Description

### 2a. What happens (briefly explain what is wrong)

### 2b. Why it's a bug (if it's not obvious)

### 2c. Log output

```
paste terminal output or logs here
```

### 2d. Workaround(s)

### 2e. Relevant links

## 3. Tutorial (minimal steps to reproduce the bug)

Instructions -- please heed otherwise we cannot help you (help us help you!)

  1. Environment: Please fill out your OS and Caddy versions, even if you don't think they are relevant. (They are always relevant.) If you built Caddy from source, provide the commit SHA and specify your exact Go version.

  2. Description: Describe at a high level what the bug is. What happens? Why is it a bug? Not all bugs are obvious, so convince readers that it's actually a bug.

    • 2c) Log output: Paste terminal output and/or complete logs in a code block. DO NOT REDACT INFORMATION except for credentials.
    • 2d) Workaround: What are you doing to work around the problem in the meantime? This can help others who encounter the same problem, until we implement a fix.
    • 2e) Relevant links: Please link to any related issues, pull requests, docs, and/or discussion. This can add crucial context to your report.
  3. Tutorial: What are the minimum required specific steps someone needs to take in order to experience the same bug? Your goal here is to make sure that anyone else can have the same experience with the bug as you do. You are writing a tutorial, so make sure to carry it out yourself before posting it. Please:

    • Start with an empty config. Add only the lines/parameters that are absolutely required to reproduce the bug.
    • Do not run Caddy inside containers.
    • Run Caddy manually in your terminal; do not use systemd or other init systems.
    • If making HTTP requests, avoid web browsers. Use a simpler HTTP client instead, like curl.
    • Do not redact any information from your config (except credentials). Domain names are public knowledge and often necessary for quick resolution of an issue!
    • Note that ignoring this advice may result in delays, or even in your issue being closed. 😞 Only actionable issues are kept open, and if there is not enough information or clarity to reproduce the bug, then the report is not actionable.

Example of a tutorial:

Create a config file: ``` { ... } ``` Open terminal and run Caddy: ``` $ caddy ... ``` Make an HTTP request: ``` $ curl ... ``` Notice that the result is ___ but it should be ___.
mildred commented 2 years ago

Well, this is an old version of caddy, and the nature of the issue makes it very difficult to reproduce. I mostly filed a bug in case someone else stumbled upon it so it can gather information in the future, of if I experience it again. I'm closing it for now until we can find a reproducer.

emilylange commented 2 years ago

I've seen this error before too. Did you (@mildred) run caddy in a cluster (multiple parallel instances using the same storage) or perhaps with a network filesystem?

mildred commented 2 years ago

No, it's running a classical mounted filesystem on a physical machine. The same setup on a VPS does not have the issue

edit: a VM (virt-manager) on a physical machine

mildred commented 2 years ago

By the way, I still have the issue with latest caddy v2.4.6 h1:HGkGICFGvyrodcqOOclHKfvJC0qTU7vny/7FhYp9hNw=

edit: I configured log rotation better, so I get access to logs on a functionning host... I'll attach debug info next

mildred commented 2 years ago

The problem seems tied to a certificate maintainance task that cannot succeed...

The sequence of events:

emilylange commented 2 years ago

No, it's running a classical mounted filesystem on a physical machine. The same setup on a VPS does not have the issue

That's interesting and good to know. I've only ever seen this when running on of a network filesystem.

By the way, I still have the issue with latest caddy v2.4.6 h1:HGkGICFGvyrodcqOOclHKfvJC0qTU7vny/7FhYp9hNw=

Wait. Are you still encountering hundreds of logs entries per second right now? If so, try deleting the mentioned (un)lock file(s) manually. That should fix it, at least temporarily. In your case that would be:

rm /var/lib/caddy/locks/*

And posting the output of ls -l /var/lib/caddy/locks/ before executing the rm would be helpful too. (But only if you are getting spammed right now.)

mildred commented 2 years ago

Some locks seems to contain JSON, some seems empty, some seems to contain zeros:

root@mail:/var/lib/caddy/locks# ls -la
total 60
drwx------  2 caddy caddy  4096 Dec  3 23:01 .
drwxr-xr-x+ 8 caddy caddy 12288 Dec  3 23:01 ..
-rw-r--r--  1 caddy caddy     0 Nov 10 08:52 issue_cert_autoconfig.mildred.fr.lock
-rw-r--r--  1 caddy caddy    98 Nov 25 11:02 issue_cert_autoconfig.test.webmsg.me.lock
-rw-r--r--  1 caddy caddy     0 Nov 25 11:03 issue_cert_autoconfig.test.webmsg.me.lock.unlock
-rw-r--r--  1 caddy caddy    98 Oct  1 18:45 issue_cert_autoconfig.webmsg.me.lock
-rw-r--r--  1 caddy caddy     0 Nov 25 11:02 issue_cert_dav.mildred.fr.lock
-rw-r--r--  1 caddy caddy     0 Nov 25 11:03 issue_cert_dav.mildred.fr.lock.unlock
-rw-r--r--  1 caddy caddy     0 Nov  9 16:46 issue_cert_jmap.mildred.fr.lock
-rw-r--r--  1 caddy caddy    98 Dec  3 23:27 issue_cert_jmap.test.webmsg.me.lock
-rw-r--r--  1 caddy caddy    98 Aug 12 09:23 issue_cert_jmap.webmsg.me.lock
-rw-r--r--  1 caddy caddy    98 Nov  9 16:37 issue_cert_mildred.fr.lock
-rw-r--r--  1 caddy caddy    98 Oct 18 19:47 issue_cert_mildred.webmsg.me.lock
-rw-r--r--  1 caddy caddy    98 Sep 30 19:49 issue_cert_roundcube.mildred.fr.lock
-rw-r--r--  1 caddy caddy    98 Dec  3 23:27 issue_cert_roundcube.test.webmsg.me.lock
-rw-r--r--  1 caddy caddy     0 Nov 25 11:03 issue_cert_test.webmsg.me.lock
-rw-r--r--  1 caddy caddy     0 Nov 25 11:03 issue_cert_test.webmsg.me.lock.unlock
-rw-r--r--  1 caddy caddy    98 Dec  3 23:27 issue_cert_webmail.mildred.fr.lock
-rw-r--r--  1 caddy caddy     0 May 28  2021 issue_cert_webmail.mildred.webmsg.me.lock
-rw-r--r--  1 caddy caddy     0 Nov 25 11:03 issue_cert_webmail.test.webmsg.me.lock
-rw-r--r--  1 caddy caddy     0 Nov 25 11:03 issue_cert_webmail.test.webmsg.me.lock.unlock
-rw-r--r--  1 caddy caddy    98 Oct  1 20:27 issue_cert_webmsg.me.lock
-rw-r--r--  1 caddy caddy    98 Oct 18 19:47 issue_cert_www.mildred.fr.lock
-rw-r--r--  1 caddy caddy     0 Nov 25 11:03 issue_cert_www.test.webmsg.me.lock
-rw-r--r--  1 caddy caddy     0 Nov 25 11:03 issue_cert_www.test.webmsg.me.lock.unlock
-rw-r--r--  1 caddy caddy     0 Sep  6 09:04 issue_cert_www.webmsg.me.lock
root@mail:/var/lib/caddy/locks# hexdump -C issue_cert_autoconfig.test.webmsg.me.lock
00000000  7b 22 63 72 65 61 74 65  64 22 3a 22 32 30 32 31  |{"created":"2021|
00000010  2d 31 31 2d 32 35 54 31  30 3a 32 39 3a 30 38 2e  |-11-25T10:29:08.|
00000020  34 31 35 34 39 38 32 34  31 2b 30 31 3a 30 30 22  |415498241+01:00"|
00000030  2c 22 75 70 64 61 74 65  64 22 3a 22 32 30 32 31  |,"updated":"2021|
00000040  2d 31 31 2d 32 35 54 31  31 3a 30 32 3a 35 39 2e  |-11-25T11:02:59.|
00000050  38 31 38 37 37 32 34 34  31 2b 30 31 3a 30 30 22  |818772441+01:00"|
00000060  7d 0a                                             |}.|
00000062
root@mail:/var/lib/caddy/locks# file *
issue_cert_autoconfig.mildred.fr.lock:            empty
issue_cert_autoconfig.test.webmsg.me.lock:        JSON data
issue_cert_autoconfig.test.webmsg.me.lock.unlock: empty
issue_cert_autoconfig.webmsg.me.lock:             data
issue_cert_dav.mildred.fr.lock:                   empty
issue_cert_dav.mildred.fr.lock.unlock:            empty
issue_cert_jmap.mildred.fr.lock:                  empty
issue_cert_jmap.test.webmsg.me.lock:              JSON data
issue_cert_jmap.webmsg.me.lock:                   data
issue_cert_mildred.fr.lock:                       data
issue_cert_mildred.webmsg.me.lock:                data
issue_cert_roundcube.mildred.fr.lock:             data
issue_cert_roundcube.test.webmsg.me.lock:         JSON data
issue_cert_test.webmsg.me.lock:                   empty
issue_cert_test.webmsg.me.lock.unlock:            empty
issue_cert_webmail.mildred.fr.lock:               JSON data
issue_cert_webmail.mildred.webmsg.me.lock:        empty
issue_cert_webmail.test.webmsg.me.lock:           empty
issue_cert_webmail.test.webmsg.me.lock.unlock:    empty
issue_cert_webmsg.me.lock:                        data
issue_cert_www.mildred.fr.lock:                   data
issue_cert_www.test.webmsg.me.lock:               empty
issue_cert_www.test.webmsg.me.lock.unlock:        empty
issue_cert_www.webmsg.me.lock:                    empty
root@mail:/var/lib/caddy/locks# hexdump -C issue_cert_autoconfig.webmsg.me.lock
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000060  00 00                                             |..|
00000062
root@mail:/var/lib/caddy/locks# hexdump -C issue_cert_jmap.webmsg.me.lock
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000060  00 00                                             |..|
00000062
root@mail:/var/lib/caddy/locks# hexdump -C issue_cert_www.mildred.fr.lock
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000060  00 00                                             |..|
00000062
mildred commented 2 years ago

It could be that because of logs filling up, caddy could not save the lock files properly resulting in corrupted files

mholt commented 2 years ago

Reopening because this discussion/investigation is interesting. Worth a second look.

mildred commented 2 years ago

Let's look at the time between when the system starts up and when Caddy starts generating those log lines (so I could perhaps make a small reproducer)...

The server rebooted since, and the disk being full, it may not exhibit the same behaviour. I'll remove a log and restart. By the way, /var/log/syslog and /var/log/messages both contains the log lines above, but syslog is 11G and messages is only 6.5G

mildred commented 2 years ago

The bug clearly does not happen every time. Since this morning Caddy is quiet.

Trying to run caddy outside of its systemd service, I cannot reproduce the problem. However if I give Caddy the autosaved JSON config, I have the error. The difference between those two files is exclusively in the domain list

Edit, the difference in the domain list, I suspect it is only relevant for apps.http.tls.certificates.automate is:

@@ -506,12 +578,18 @@
       "certificates": {
         "@id": "tls.certificates",
         "automate": [
-          "webmsg.me",
-          "www.webmsg.me",
           "bourderie-1.webmsg.me",
           "www.bourderie-1.webmsg.me",
+          "dav.bourderie-1.webmsg.me",
           "jmap.bourderie-1.webmsg.me",
-          "dav.bourderie-1.webmsg.me"
+          "mildred.fr",
+          "www.mildred.fr",
+          "dav.mildred.fr",
+          "jmap.mildred.fr",
+          "test.webmsg.me",
+          "www.test.webmsg.me",
+          "dav.test.webmsg.me",
+          "jmap.test.webmsg.me"
         ]
       }
     }

Running caddy for some time, and trying to get the list of domains triggering the offending log line yields every time the same domain:

root@mail:/home/postmaster# /usr/local/bin/caddy run --watch --config autosave.json 2>&1 | grep 'is stale' | cut -d: -f9- | uniq
02:59.818772441 +0100 CET); removing then retrying: /var/lib/caddy/locks/issue_cert_autoconfig.test.webmsg.me.lock

Looking at the lock, it's a legitimate JSON file:

root@mail:/home/postmaster# hexdump -C /var/lib/caddy/locks/issue_cert_autoconfig.test.webmsg.me.lock
00000000  7b 22 63 72 65 61 74 65  64 22 3a 22 32 30 32 31  |{"created":"2021|
00000010  2d 31 31 2d 32 35 54 31  30 3a 32 39 3a 30 38 2e  |-11-25T10:29:08.|
00000020  34 31 35 34 39 38 32 34  31 2b 30 31 3a 30 30 22  |415498241+01:00"|
00000030  2c 22 75 70 64 61 74 65  64 22 3a 22 32 30 32 31  |,"updated":"2021|
00000040  2d 31 31 2d 32 35 54 31  31 3a 30 32 3a 35 39 2e  |-11-25T11:02:59.|
00000050  38 31 38 37 37 32 34 34  31 2b 30 31 3a 30 30 22  |818772441+01:00"|
00000060  7d 0a                                             |}.|
00000062
mildred commented 2 years ago

Trying to come up with a minimal config file, removing:

I come up with a minimal configuration with:

{
  "apps": {
    "http": {
      "servers": {
        "master": {
          "listen": [
            "127.0.0.1:443"
          ],
          "routes": [
            {
              "handle": [
                {
                  "body": "TEST",
                  "handler": "static_response",
                  "headers": {
                    "Content-Type": [
                      "text/plain; charset=utf-8"
                    ]
                  }
                }
              ],
              "match": [
                {
                  "host": [
                    "autoconfig.test.webmsg.me"
                  ]
                }
              ]
            }
          ]
        }
      }
    }
  },
  "logging": {
    "logs": {
      "default": {
        "level": "INFO",
        "writer": {
          "output": "stdout"
        }
      }
    }
  },
  "storage": {
    "module": "file_system",
    "root": "/var/lib/caddy"
  }
}

When I run Caddy with:

/usr/local/bin/caddy run --watch --config autosave-2.json

I get too many of:

2021/12/04 23:16:09 [INFO][FileStorage:/var/lib/caddy] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /var/lib/caddy/locks/issue_cert_autoconfig.test.webmsg.me.lock
mildred commented 2 years ago

Got a minimal reproducer:

mkdir -p /tmp/caddy-bug/locks
touch /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock.unlock
echo '{"created":"2021-11-25T10:29:08.415498241+01:00","updated":"2021-11-25T11:02:59.818772441+01:00"}' >/tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock

config:

{
  "apps": {
    "http": {
      "servers": {
        "master": {
          "listen": [
            "127.0.0.1:443"
          ],
          "routes": [
            {
              "handle": [
                {
                  "body": "TEST",
                  "handler": "static_response",
                  "headers": {
                    "Content-Type": [
                      "text/plain; charset=utf-8"
                    ]
                  }
                }
              ],
              "match": [
                {
                  "host": [
                    "autoconfig.test.webmsg.me"
                  ]
                }
              ]
            }
          ]
        }
      }
    }
  },
  "logging": {
    "logs": {
      "default": {
        "level": "INFO",
        "writer": {
          "output": "stdout"
        }
      }
    }
  },
  "storage": {
    "module": "file_system",
    "root": "/tmp/caddy-bug"
  }
}

Start Caddy and trigger bug:

/usr/local/bin/caddy run --watch --config config.json
mildred commented 2 years ago

edit: there is a problem with my reproducer, disregard this below

There is really something around this domain that causes the bug. Its a NXDOMAIN, but not all NXDOMAINs are causing the bug

postmaster@mail:~/bug$ sudo ./trigger-bug.sh 
+ CADDY=/home/postmaster/bug/caddy
+ CONFIG=/home/postmaster/bug/caddy-bug.json
+ CADDYHOME=/home/postmaster/bug/caddy-bug
+ DOMAIN=autoconfig.test.webmsg.me
+ mkdir -p /home/postmaster/bug/caddy-bug/locks
+ touch /home/postmaster/bug/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock.unlock
+ echo {"created":"2021-11-25T10:29:08.415498241+01:00","updated":"2021-11-25T11:02:59.818772441+01:00"}
+ [ -e /home/postmaster/bug/caddy ]
+ cat
+ /home/postmaster/bug/caddy run --watch --config /home/postmaster/bug/caddy-bug.json
2021/12/04 22:42:23.356 INFO    using provided configuration    {"config_file": "/home/postmaster/bug/caddy-bug.json", "config_adapter": ""}
2021/12/04 22:42:23.358 INFO    redirected default logger   {"from": "stderr", "to": "stdout"}
2021/12/04 22:42:23.361 INFO    admin   admin endpoint started  {"address": "tcp/localhost:2019", "enforce_origin": false, "origins": ["localhost:2019", "[::1]:2019", "127.0.0.1:2019"]}
2021/12/04 22:42:23.363 INFO    http    server is listening only on the HTTPS port but has no TLS connection policies; adding one to enable TLS {"server_name": "master", "https_port": 443}
2021/12/04 22:42:23.363 INFO    http    enabling automatic HTTP->HTTPS redirects    {"server_name": "master"}
2021/12/04 22:42:23.363 INFO    tls.cache.maintenance   started background certificate maintenance  {"cache": "0xc00030fb90"}
2021/12/04 22:42:23.364 INFO    http    enabling automatic TLS certificate management   {"domains": ["autoconfig.test.webmsg.me"]}
2021/12/04 22:42:23.366 INFO    autosaved config (load with --resume flag)  {"file": "/root/.config/caddy/autosave.json"}
2021/12/04 22:42:23.366 INFO    serving initial configuration
2021/12/04 22:42:23.366 INFO    watcher watching config file for changes    {"config_file": "/home/postmaster/bug/caddy-bug.json"}
2021/12/04 22:42:23.368 INFO    tls cleaning storage unit   {"description": "FileStorage:/tmp/caddy-bug"}
2021/12/04 22:42:23.368 INFO    tls finished cleaning storage units
2021/12/04 22:42:23.369 INFO    tls.obtain  acquiring lock  {"identifier": "autoconfig.test.webmsg.me"}
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock

With another NXDOMAIN:

postmaster@mail:~/bug$ sudo ./trigger-bug.sh 
+ CADDY=/home/postmaster/bug/caddy
+ CONFIG=/home/postmaster/bug/caddy-bug.json
+ CADDYHOME=/home/postmaster/bug/caddy-bug
+ DOMAIN=nxdomain.test.webmsg.me
+ mkdir -p /home/postmaster/bug/caddy-bug/locks
+ touch /home/postmaster/bug/caddy-bug/locks/issue_cert_nxdomain.test.webmsg.me.lock.unlock
+ echo {"created":"2021-11-25T10:29:08.415498241+01:00","updated":"2021-11-25T11:02:59.818772441+01:00"}
+ [ -e /home/postmaster/bug/caddy ]
+ cat
+ /home/postmaster/bug/caddy run --watch --config /home/postmaster/bug/caddy-bug.json
2021/12/04 22:42:50.845 INFO    using provided configuration    {"config_file": "/home/postmaster/bug/caddy-bug.json", "config_adapter": ""}
2021/12/04 22:42:50.847 INFO    redirected default logger   {"from": "stderr", "to": "stdout"}
2021/12/04 22:42:50.869 INFO    admin   admin endpoint started  {"address": "tcp/localhost:2019", "enforce_origin": false, "origins": ["[::1]:2019", "127.0.0.1:2019", "localhost:2019"]}
2021/12/04 22:42:50.870 INFO    tls.cache.maintenance   started background certificate maintenance  {"cache": "0xc0002b8000"}
2021/12/04 22:42:50.871 INFO    http    server is listening only on the HTTPS port but has no TLS connection policies; adding one to enable TLS {"server_name": "master", "https_port": 443}
2021/12/04 22:42:50.871 INFO    http    enabling automatic HTTP->HTTPS redirects    {"server_name": "master"}
2021/12/04 22:42:50.872 INFO    tls cleaning storage unit   {"description": "FileStorage:/tmp/caddy-bug"}
2021/12/04 22:42:50.872 INFO    tls finished cleaning storage units
2021/12/04 22:42:50.873 INFO    http    enabling automatic TLS certificate management   {"domains": ["nxdomain.test.webmsg.me"]}
2021/12/04 22:42:50.875 INFO    tls.obtain  acquiring lock  {"identifier": "nxdomain.test.webmsg.me"}
2021/12/04 22:42:50.876 INFO    autosaved config (load with --resume flag)  {"file": "/root/.config/caddy/autosave.json"}
2021/12/04 22:42:50.876 INFO    serving initial configuration
2021/12/04 22:42:50.876 INFO    watcher watching config file for changes    {"config_file": "/home/postmaster/bug/caddy-bug.json"}
2021/12/04 22:42:51.418 INFO    tls.obtain  lock acquired   {"identifier": "nxdomain.test.webmsg.me"}
2021/12/04 22:42:51.426 INFO    tls.issuance.acme   waiting on internal rate limiter    {"identifiers": ["nxdomain.test.webmsg.me"], "ca": "https://acme-v02.api.letsencrypt.org/directory", "account": ""}
2021/12/04 22:42:51.427 INFO    tls.issuance.acme   done waiting on internal rate limiter   {"identifiers": ["nxdomain.test.webmsg.me"], "ca": "https://acme-v02.api.letsencrypt.org/directory", "account": ""}
2021/12/04 22:42:54.259 INFO    tls.issuance.acme.acme_client   trying to solve challenge   {"identifier": "nxdomain.test.webmsg.me", "challenge_type": "http-01", "ca": "https://acme-v02.api.letsencrypt.org/directory"}
2021/12/04 22:42:55.346 ERROR   tls.issuance.acme.acme_client   challenge failed    {"identifier": "nxdomain.test.webmsg.me", "challenge_type": "http-01", "problem": {"type": "urn:ietf:params:acme:error:dns", "title": "", "detail": "DNS problem: NXDOMAIN looking up A for nxdomain.test.webmsg.me - check that a DNS record exists for this domain", "instance": "", "subproblems": []}}
2021/12/04 22:42:55.348 ERROR   tls.issuance.acme.acme_client   validating authorization    {"identifier": "nxdomain.test.webmsg.me", "problem": {"type": "urn:ietf:params:acme:error:dns", "title": "", "detail": "DNS problem: NXDOMAIN looking up A for nxdomain.test.webmsg.me - check that a DNS record exists for this domain", "instance": "", "subproblems": []}, "order": "https://acme-v02.api.letsencrypt.org/acme/order/307839760/44777106950", "attempt": 1, "max_attempts": 3}
2021/12/04 22:42:57.050 INFO    tls.issuance.acme.acme_client   trying to solve challenge   {"identifier": "nxdomain.test.webmsg.me", "challenge_type": "tls-alpn-01", "ca": "https://acme-v02.api.letsencrypt.org/directory"}
2021/12/04 22:42:57.761 ERROR   tls.issuance.acme.acme_client   challenge failed    {"identifier": "nxdomain.test.webmsg.me", "challenge_type": "tls-alpn-01", "problem": {"type": "urn:ietf:params:acme:error:dns", "title": "", "detail": "DNS problem: NXDOMAIN looking up A for nxdomain.test.webmsg.me - check that a DNS record exists for this domain", "instance": "", "subproblems": []}}
2021/12/04 22:42:57.765 ERROR   tls.issuance.acme.acme_client   validating authorization    {"identifier": "nxdomain.test.webmsg.me", "problem": {"type": "urn:ietf:params:acme:error:dns", "title": "", "detail": "DNS problem: NXDOMAIN looking up A for nxdomain.test.webmsg.me - check that a DNS record exists for this domain", "instance": "", "subproblems": []}, "order": "https://acme-v02.api.letsencrypt.org/acme/order/307839760/44777112450", "attempt": 2, "max_attempts": 3}
2021/12/04 22:43:00.356 ERROR   tls.obtain  could not get certificate from issuer   {"identifier": "nxdomain.test.webmsg.me", "issuer": "acme-v02.api.letsencrypt.org-directory", "error": "[nxdomain.test.webmsg.me] solving challenges: nxdomain.test.webmsg.me: no solvers available for remaining challenges (configured=[http-01 tls-alpn-01] offered=[http-01 dns-01 tls-alpn-01] remaining=[dns-01]) (order=https://acme-v02.api.letsencrypt.org/acme/order/307839760/44777117970) (ca=https://acme-v02.api.letsencrypt.org/directory)"}
2021/12/04 22:43:00.360 WARN    tls.issuance.zerossl    missing email address for ZeroSSL; it is strongly recommended to set one for next time
^C2021/12/04 22:43:00.411   INFO    shutting down   {"signal": "SIGINT"}
2021/12/04 22:43:00.412 WARN    exiting; byeee!! 👋  {"signal": "SIGINT"}
2021/12/04 22:43:00.417 INFO    tls.cache.maintenance   stopped background certificate maintenance  {"cache": "0xc0002b8000"}
2021/12/04 22:43:00.418 ERROR   tls.obtain  could not get certificate from issuer   {"identifier": "nxdomain.test.webmsg.me", "issuer": "acme.zerossl.com-v2-DV90", "error": "account pre-registration callback: performing EAB credentials request: Post \"https://api.zerossl.com/acme/eab-credentials-email\": context canceled"}
2021/12/04 22:43:00.420 ERROR   tls.obtain  will retry  {"error": "[nxdomain.test.webmsg.me] Obtain: account pre-registration callback: performing EAB credentials request: Post \"https://api.zerossl.com/acme/eab-credentials-email\": context canceled", "attempt": 1, "retrying_in": 60, "elapsed": 8.999581352, "max_duration": 2592000}
2021/12/04 22:43:00.420 INFO    tls.obtain  releasing lock  {"identifier": "nxdomain.test.webmsg.me"}
2021/12/04 22:43:00.421 INFO    admin   stopped previous server {"address": "tcp/localhost:2019"}
2021/12/04 22:43:00.421 INFO    shutdown complete   {"signal": "SIGINT", "exit_code": 0}
mildred commented 2 years ago

Got an easy reproducer, see first comment.

I believe I ran out of disk space at some point which caused bad lock state. I increased since then the VM disk size, but the bad state was kept in the lock files.

edit: The reason it did not happen every time is that the offending domain name is added later on via the admin API via an external script that needs to run. Or Caddy needs to restore a config with the offending domain.

francislavoie commented 2 years ago

Okay so it seems to me like there's an infinite loop that needs fixing.

I think it's in certmagic, not Caddy, actually. If you go searching for that error message in the code, you should be able to find it. (I'm away for the weekend, or I'd take closer look myself right now)

mildred commented 2 years ago

I didn't found the message in Caddy's code, it's probably in a dependency, yes

francislavoie commented 2 years ago

It's in https://github.com/caddyserver/certmagic

mholt commented 2 years ago

Thanks for the added details, I will circle back around to this when I get a chance! (or someone else can beat me to it, of course)

mholt commented 2 years ago

We aren't checking the error from removeLockFile() here after that log message is emitted:

https://github.com/caddyserver/certmagic/blob/9b2c6ac88468351d4cd3aa9b3f378b6da7be06e4/filestorage.go#L163-L168

So it's possible that the removal is failing and causing an infinite loop.

However, the removeLockFile() function creates those .unlock files as you reported earlier, and if those are left hanging around, then we could have another problem because it signals that something else is deleting the lock currently, so we just return right away and iterate the loop. That is probably causing the infinite loop in this case.

It looks like in the original implementation of this atomic unlocking, the point of it being atomic was to clean up empty dirs after removing a lock, but now the caller does that in a separate function of CleanStorage(). So I wonder if we don't even need atomic unlocking anymore. We might be able to simply do os.Remove() and be ok 99.99~% of the time.

I do not know of a way to do an atomic remove that doesn't involve creating more state that can be interrupted, like was the case here. It's rare, yes; but a lingering .unlock file is the same as a lingering .lock file...

Basically, I'm thinking it might be best to just change the "atomic unlock" to a regular os.Remove() call. Yes, this does mean that if two instances see the stale lock at the exact same time, and one is able to delete it AND create the new one before the other one is able to delete it, then the new lock will be deleted. I view this as probably even rarer than the original bug, and while it would result in two instances doing something at the same time, that's less detrimental for certificates than an infinite loop is.

Thoughts?

mholt commented 2 years ago

I went ahead and committed a fix that I hope will work. @mildred can you give it a try? The mutual exclusion guarantees are weaker now in the presence of stale locks, but I suspect races on those will be rarer than the infinite looping you were experiencing (and less problematic too).