Closed mildred closed 2 years ago
Thanks for opening an issue! We'll look into this.
It's not immediately clear to me what is going on, so I'll need your help to understand it better.
Ideally, we need to be able to reproduce the bug in the most minimal way possible. This allows us to write regression tests to verify the fix is working. If we can't reproduce it, then you'll have to test our changes for us until it's fixed -- and then we can't add test cases, either.
I've attached a template below that will help make this easier and faster! This will require some effort on your part -- please understand that we will be dedicating time to fix the bug you are reporting if you can just help us understand it and reproduce it easily.
This template will ask for some information you've already provided; that's OK, just fill it out the best you can. :+1: I've also included some helpful tips below the template. Feel free to let me know if you have any questions!
Thank you again for your report, we look forward to resolving it!
## 1. Environment
### 1a. Operating system and version
```
paste here
```
### 1b. Caddy version (run `caddy version` or paste commit SHA)
```
paste here
```
### 1c. Go version (if building Caddy from source; run `go version`)
```
paste here
```
## 2. Description
### 2a. What happens (briefly explain what is wrong)
### 2b. Why it's a bug (if it's not obvious)
### 2c. Log output
```
paste terminal output or logs here
```
### 2d. Workaround(s)
### 2e. Relevant links
## 3. Tutorial (minimal steps to reproduce the bug)
Environment: Please fill out your OS and Caddy versions, even if you don't think they are relevant. (They are always relevant.) If you built Caddy from source, provide the commit SHA and specify your exact Go version.
Description: Describe at a high level what the bug is. What happens? Why is it a bug? Not all bugs are obvious, so convince readers that it's actually a bug.
Tutorial: What are the minimum required specific steps someone needs to take in order to experience the same bug? Your goal here is to make sure that anyone else can have the same experience with the bug as you do. You are writing a tutorial, so make sure to carry it out yourself before posting it. Please:
curl
.Example of a tutorial:
Create a config file: ``` { ... } ``` Open terminal and run Caddy: ``` $ caddy ... ``` Make an HTTP request: ``` $ curl ... ``` Notice that the result is ___ but it should be ___.
Well, this is an old version of caddy, and the nature of the issue makes it very difficult to reproduce. I mostly filed a bug in case someone else stumbled upon it so it can gather information in the future, of if I experience it again. I'm closing it for now until we can find a reproducer.
I've seen this error before too. Did you (@mildred) run caddy in a cluster (multiple parallel instances using the same storage) or perhaps with a network filesystem?
No, it's running a classical mounted filesystem on a physical machine. The same setup on a VPS does not have the issue
edit: a VM (virt-manager) on a physical machine
By the way, I still have the issue with latest caddy v2.4.6 h1:HGkGICFGvyrodcqOOclHKfvJC0qTU7vny/7FhYp9hNw=
edit: I configured log rotation better, so I get access to logs on a functionning host... I'll attach debug info next
The problem seems tied to a certificate maintainance task that cannot succeed...
The sequence of events:
apps.http.tls.certificates.automate
and adds some domains to the list (generated from a database). In the list there is a test domain that will not succeed in getting the certificate for this hostNov 28 05:28:32 mail caddy[1072]: {"level":"info","ts":1638073712.8930051,"logger":"tls.cache.maintenance","msg":"started background certificate maintenance","cache":"0xc00031bdc0"}
Nov 28 05:28:32 mail caddy[1072]: {"level":"info","ts":1638073712.908584,"logger":"tls.obtain","msg":"acquiring lock","identifier":"mildred.fr"}
Nov 28 05:28:32 mail caddy[1072]: {"level":"error","ts":1638073712.9096868,"logger":"tls","msg":"job failed","error":"mildred.fr: obtaining certificate: unable to acquire lock 'issue_cert_mildred.fr': decoding lockfile contents: invalid character '\\x00' looking for beginning of value"}
Nov 28 05:28:32 mail caddy[1072]: {"level":"info","ts":1638073712.915032,"logger":"tls.obtain","msg":"acquiring lock","identifier":"dav.mildred.fr"}
Nov 28 05:28:32 mail caddy[1072]: {"level":"error","ts":1638073712.915886,"logger":"tls","msg":"job failed","error":"dav.mildred.fr: obtaining certificate: unable to acquire lock 'issue_cert_dav.mildred.fr': decoding lockfile contents: EOF"}
Nov 28 05:28:32 mail caddy[1072]: {"level":"info","ts":1638073712.945054,"logger":"tls.cache.maintenance","msg":"stopped background certificate maintenance","cache":"0xc000312000"}
Nov 28 05:28:32 mail caddy[1072]: {"level":"info","ts":1638073712.9469566,"msg":"autosaved config (load with --resume flag)","file":"/var/lib/caddy/.config/caddy/autosave.json"}
Nov 28 05:28:33 mail caddy[1072]: 2021/11/28 05:28:33 [INFO][FileStorage:/var/lib/caddy] Lock for 'issue_cert_dav.test.webmsg.me' is stale (created: 2021-11-27 15:32:02.49914457 +0100 CET, last update: 2021-11-27 15:36:32.026606202 +0100 CET); removing then retrying: /var/lib/caddy/locks/issue_cert_dav.test.webmsg.me.lock
No, it's running a classical mounted filesystem on a physical machine. The same setup on a VPS does not have the issue
That's interesting and good to know. I've only ever seen this when running on of a network filesystem.
By the way, I still have the issue with latest caddy
v2.4.6 h1:HGkGICFGvyrodcqOOclHKfvJC0qTU7vny/7FhYp9hNw=
Wait. Are you still encountering hundreds of logs entries per second right now? If so, try deleting the mentioned (un)lock file(s) manually. That should fix it, at least temporarily. In your case that would be:
rm /var/lib/caddy/locks/*
And posting the output of ls -l /var/lib/caddy/locks/
before executing the rm
would be helpful too.
(But only if you are getting spammed right now.)
Some locks seems to contain JSON, some seems empty, some seems to contain zeros:
root@mail:/var/lib/caddy/locks# ls -la
total 60
drwx------ 2 caddy caddy 4096 Dec 3 23:01 .
drwxr-xr-x+ 8 caddy caddy 12288 Dec 3 23:01 ..
-rw-r--r-- 1 caddy caddy 0 Nov 10 08:52 issue_cert_autoconfig.mildred.fr.lock
-rw-r--r-- 1 caddy caddy 98 Nov 25 11:02 issue_cert_autoconfig.test.webmsg.me.lock
-rw-r--r-- 1 caddy caddy 0 Nov 25 11:03 issue_cert_autoconfig.test.webmsg.me.lock.unlock
-rw-r--r-- 1 caddy caddy 98 Oct 1 18:45 issue_cert_autoconfig.webmsg.me.lock
-rw-r--r-- 1 caddy caddy 0 Nov 25 11:02 issue_cert_dav.mildred.fr.lock
-rw-r--r-- 1 caddy caddy 0 Nov 25 11:03 issue_cert_dav.mildred.fr.lock.unlock
-rw-r--r-- 1 caddy caddy 0 Nov 9 16:46 issue_cert_jmap.mildred.fr.lock
-rw-r--r-- 1 caddy caddy 98 Dec 3 23:27 issue_cert_jmap.test.webmsg.me.lock
-rw-r--r-- 1 caddy caddy 98 Aug 12 09:23 issue_cert_jmap.webmsg.me.lock
-rw-r--r-- 1 caddy caddy 98 Nov 9 16:37 issue_cert_mildred.fr.lock
-rw-r--r-- 1 caddy caddy 98 Oct 18 19:47 issue_cert_mildred.webmsg.me.lock
-rw-r--r-- 1 caddy caddy 98 Sep 30 19:49 issue_cert_roundcube.mildred.fr.lock
-rw-r--r-- 1 caddy caddy 98 Dec 3 23:27 issue_cert_roundcube.test.webmsg.me.lock
-rw-r--r-- 1 caddy caddy 0 Nov 25 11:03 issue_cert_test.webmsg.me.lock
-rw-r--r-- 1 caddy caddy 0 Nov 25 11:03 issue_cert_test.webmsg.me.lock.unlock
-rw-r--r-- 1 caddy caddy 98 Dec 3 23:27 issue_cert_webmail.mildred.fr.lock
-rw-r--r-- 1 caddy caddy 0 May 28 2021 issue_cert_webmail.mildred.webmsg.me.lock
-rw-r--r-- 1 caddy caddy 0 Nov 25 11:03 issue_cert_webmail.test.webmsg.me.lock
-rw-r--r-- 1 caddy caddy 0 Nov 25 11:03 issue_cert_webmail.test.webmsg.me.lock.unlock
-rw-r--r-- 1 caddy caddy 98 Oct 1 20:27 issue_cert_webmsg.me.lock
-rw-r--r-- 1 caddy caddy 98 Oct 18 19:47 issue_cert_www.mildred.fr.lock
-rw-r--r-- 1 caddy caddy 0 Nov 25 11:03 issue_cert_www.test.webmsg.me.lock
-rw-r--r-- 1 caddy caddy 0 Nov 25 11:03 issue_cert_www.test.webmsg.me.lock.unlock
-rw-r--r-- 1 caddy caddy 0 Sep 6 09:04 issue_cert_www.webmsg.me.lock
root@mail:/var/lib/caddy/locks# hexdump -C issue_cert_autoconfig.test.webmsg.me.lock
00000000 7b 22 63 72 65 61 74 65 64 22 3a 22 32 30 32 31 |{"created":"2021|
00000010 2d 31 31 2d 32 35 54 31 30 3a 32 39 3a 30 38 2e |-11-25T10:29:08.|
00000020 34 31 35 34 39 38 32 34 31 2b 30 31 3a 30 30 22 |415498241+01:00"|
00000030 2c 22 75 70 64 61 74 65 64 22 3a 22 32 30 32 31 |,"updated":"2021|
00000040 2d 31 31 2d 32 35 54 31 31 3a 30 32 3a 35 39 2e |-11-25T11:02:59.|
00000050 38 31 38 37 37 32 34 34 31 2b 30 31 3a 30 30 22 |818772441+01:00"|
00000060 7d 0a |}.|
00000062
root@mail:/var/lib/caddy/locks# file *
issue_cert_autoconfig.mildred.fr.lock: empty
issue_cert_autoconfig.test.webmsg.me.lock: JSON data
issue_cert_autoconfig.test.webmsg.me.lock.unlock: empty
issue_cert_autoconfig.webmsg.me.lock: data
issue_cert_dav.mildred.fr.lock: empty
issue_cert_dav.mildred.fr.lock.unlock: empty
issue_cert_jmap.mildred.fr.lock: empty
issue_cert_jmap.test.webmsg.me.lock: JSON data
issue_cert_jmap.webmsg.me.lock: data
issue_cert_mildred.fr.lock: data
issue_cert_mildred.webmsg.me.lock: data
issue_cert_roundcube.mildred.fr.lock: data
issue_cert_roundcube.test.webmsg.me.lock: JSON data
issue_cert_test.webmsg.me.lock: empty
issue_cert_test.webmsg.me.lock.unlock: empty
issue_cert_webmail.mildred.fr.lock: JSON data
issue_cert_webmail.mildred.webmsg.me.lock: empty
issue_cert_webmail.test.webmsg.me.lock: empty
issue_cert_webmail.test.webmsg.me.lock.unlock: empty
issue_cert_webmsg.me.lock: data
issue_cert_www.mildred.fr.lock: data
issue_cert_www.test.webmsg.me.lock: empty
issue_cert_www.test.webmsg.me.lock.unlock: empty
issue_cert_www.webmsg.me.lock: empty
root@mail:/var/lib/caddy/locks# hexdump -C issue_cert_autoconfig.webmsg.me.lock
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000060 00 00 |..|
00000062
root@mail:/var/lib/caddy/locks# hexdump -C issue_cert_jmap.webmsg.me.lock
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000060 00 00 |..|
00000062
root@mail:/var/lib/caddy/locks# hexdump -C issue_cert_www.mildred.fr.lock
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000060 00 00 |..|
00000062
It could be that because of logs filling up, caddy could not save the lock files properly resulting in corrupted files
Reopening because this discussion/investigation is interesting. Worth a second look.
Let's look at the time between when the system starts up and when Caddy starts generating those log lines (so I could perhaps make a small reproducer)...
Dec 3 23:00:53 mail kernel: [ 0.000000] Linux version 5.10.0-9-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Debian 5.10.70-1 (2021-09-30)
(around 57% of the file)Dec 4 00:06:12 mail caddy[913]: 2021/12/04 00:00:51 [INFO][FileStorage:/var/lib/caddy] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /var/lib/caddy/locks/issue_cert_autoconfig.test.webmsg.me.lock
The server rebooted since, and the disk being full, it may not exhibit the same behaviour. I'll remove a log and restart. By the way, /var/log/syslog and /var/log/messages both contains the log lines above, but syslog is 11G and messages is only 6.5G
The bug clearly does not happen every time. Since this morning Caddy is quiet.
Trying to run caddy outside of its systemd service, I cannot reproduce the problem. However if I give Caddy the autosaved JSON config, I have the error. The difference between those two files is exclusively in the domain list
Edit, the difference in the domain list, I suspect it is only relevant for apps.http.tls.certificates.automate
is:
@@ -506,12 +578,18 @@
"certificates": {
"@id": "tls.certificates",
"automate": [
- "webmsg.me",
- "www.webmsg.me",
"bourderie-1.webmsg.me",
"www.bourderie-1.webmsg.me",
+ "dav.bourderie-1.webmsg.me",
"jmap.bourderie-1.webmsg.me",
- "dav.bourderie-1.webmsg.me"
+ "mildred.fr",
+ "www.mildred.fr",
+ "dav.mildred.fr",
+ "jmap.mildred.fr",
+ "test.webmsg.me",
+ "www.test.webmsg.me",
+ "dav.test.webmsg.me",
+ "jmap.test.webmsg.me"
]
}
}
Running caddy for some time, and trying to get the list of domains triggering the offending log line yields every time the same domain:
root@mail:/home/postmaster# /usr/local/bin/caddy run --watch --config autosave.json 2>&1 | grep 'is stale' | cut -d: -f9- | uniq
02:59.818772441 +0100 CET); removing then retrying: /var/lib/caddy/locks/issue_cert_autoconfig.test.webmsg.me.lock
Looking at the lock, it's a legitimate JSON file:
root@mail:/home/postmaster# hexdump -C /var/lib/caddy/locks/issue_cert_autoconfig.test.webmsg.me.lock
00000000 7b 22 63 72 65 61 74 65 64 22 3a 22 32 30 32 31 |{"created":"2021|
00000010 2d 31 31 2d 32 35 54 31 30 3a 32 39 3a 30 38 2e |-11-25T10:29:08.|
00000020 34 31 35 34 39 38 32 34 31 2b 30 31 3a 30 30 22 |415498241+01:00"|
00000030 2c 22 75 70 64 61 74 65 64 22 3a 22 32 30 32 31 |,"updated":"2021|
00000040 2d 31 31 2d 32 35 54 31 31 3a 30 32 3a 35 39 2e |-11-25T11:02:59.|
00000050 38 31 38 37 37 32 34 34 31 2b 30 31 3a 30 30 22 |818772441+01:00"|
00000060 7d 0a |}.|
00000062
Trying to come up with a minimal config file, removing:
apps.http.tls.certificates.automate
which does not contains the offending hostI come up with a minimal configuration with:
{
"apps": {
"http": {
"servers": {
"master": {
"listen": [
"127.0.0.1:443"
],
"routes": [
{
"handle": [
{
"body": "TEST",
"handler": "static_response",
"headers": {
"Content-Type": [
"text/plain; charset=utf-8"
]
}
}
],
"match": [
{
"host": [
"autoconfig.test.webmsg.me"
]
}
]
}
]
}
}
}
},
"logging": {
"logs": {
"default": {
"level": "INFO",
"writer": {
"output": "stdout"
}
}
}
},
"storage": {
"module": "file_system",
"root": "/var/lib/caddy"
}
}
When I run Caddy with:
/usr/local/bin/caddy run --watch --config autosave-2.json
I get too many of:
2021/12/04 23:16:09 [INFO][FileStorage:/var/lib/caddy] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /var/lib/caddy/locks/issue_cert_autoconfig.test.webmsg.me.lock
Got a minimal reproducer:
mkdir -p /tmp/caddy-bug/locks
touch /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock.unlock
echo '{"created":"2021-11-25T10:29:08.415498241+01:00","updated":"2021-11-25T11:02:59.818772441+01:00"}' >/tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
config:
{
"apps": {
"http": {
"servers": {
"master": {
"listen": [
"127.0.0.1:443"
],
"routes": [
{
"handle": [
{
"body": "TEST",
"handler": "static_response",
"headers": {
"Content-Type": [
"text/plain; charset=utf-8"
]
}
}
],
"match": [
{
"host": [
"autoconfig.test.webmsg.me"
]
}
]
}
]
}
}
}
},
"logging": {
"logs": {
"default": {
"level": "INFO",
"writer": {
"output": "stdout"
}
}
}
},
"storage": {
"module": "file_system",
"root": "/tmp/caddy-bug"
}
}
Start Caddy and trigger bug:
/usr/local/bin/caddy run --watch --config config.json
edit: there is a problem with my reproducer, disregard this below
There is really something around this domain that causes the bug. Its a NXDOMAIN, but not all NXDOMAINs are causing the bug
postmaster@mail:~/bug$ sudo ./trigger-bug.sh
+ CADDY=/home/postmaster/bug/caddy
+ CONFIG=/home/postmaster/bug/caddy-bug.json
+ CADDYHOME=/home/postmaster/bug/caddy-bug
+ DOMAIN=autoconfig.test.webmsg.me
+ mkdir -p /home/postmaster/bug/caddy-bug/locks
+ touch /home/postmaster/bug/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock.unlock
+ echo {"created":"2021-11-25T10:29:08.415498241+01:00","updated":"2021-11-25T11:02:59.818772441+01:00"}
+ [ -e /home/postmaster/bug/caddy ]
+ cat
+ /home/postmaster/bug/caddy run --watch --config /home/postmaster/bug/caddy-bug.json
2021/12/04 22:42:23.356 INFO using provided configuration {"config_file": "/home/postmaster/bug/caddy-bug.json", "config_adapter": ""}
2021/12/04 22:42:23.358 INFO redirected default logger {"from": "stderr", "to": "stdout"}
2021/12/04 22:42:23.361 INFO admin admin endpoint started {"address": "tcp/localhost:2019", "enforce_origin": false, "origins": ["localhost:2019", "[::1]:2019", "127.0.0.1:2019"]}
2021/12/04 22:42:23.363 INFO http server is listening only on the HTTPS port but has no TLS connection policies; adding one to enable TLS {"server_name": "master", "https_port": 443}
2021/12/04 22:42:23.363 INFO http enabling automatic HTTP->HTTPS redirects {"server_name": "master"}
2021/12/04 22:42:23.363 INFO tls.cache.maintenance started background certificate maintenance {"cache": "0xc00030fb90"}
2021/12/04 22:42:23.364 INFO http enabling automatic TLS certificate management {"domains": ["autoconfig.test.webmsg.me"]}
2021/12/04 22:42:23.366 INFO autosaved config (load with --resume flag) {"file": "/root/.config/caddy/autosave.json"}
2021/12/04 22:42:23.366 INFO serving initial configuration
2021/12/04 22:42:23.366 INFO watcher watching config file for changes {"config_file": "/home/postmaster/bug/caddy-bug.json"}
2021/12/04 22:42:23.368 INFO tls cleaning storage unit {"description": "FileStorage:/tmp/caddy-bug"}
2021/12/04 22:42:23.368 INFO tls finished cleaning storage units
2021/12/04 22:42:23.369 INFO tls.obtain acquiring lock {"identifier": "autoconfig.test.webmsg.me"}
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
With another NXDOMAIN:
postmaster@mail:~/bug$ sudo ./trigger-bug.sh
+ CADDY=/home/postmaster/bug/caddy
+ CONFIG=/home/postmaster/bug/caddy-bug.json
+ CADDYHOME=/home/postmaster/bug/caddy-bug
+ DOMAIN=nxdomain.test.webmsg.me
+ mkdir -p /home/postmaster/bug/caddy-bug/locks
+ touch /home/postmaster/bug/caddy-bug/locks/issue_cert_nxdomain.test.webmsg.me.lock.unlock
+ echo {"created":"2021-11-25T10:29:08.415498241+01:00","updated":"2021-11-25T11:02:59.818772441+01:00"}
+ [ -e /home/postmaster/bug/caddy ]
+ cat
+ /home/postmaster/bug/caddy run --watch --config /home/postmaster/bug/caddy-bug.json
2021/12/04 22:42:50.845 INFO using provided configuration {"config_file": "/home/postmaster/bug/caddy-bug.json", "config_adapter": ""}
2021/12/04 22:42:50.847 INFO redirected default logger {"from": "stderr", "to": "stdout"}
2021/12/04 22:42:50.869 INFO admin admin endpoint started {"address": "tcp/localhost:2019", "enforce_origin": false, "origins": ["[::1]:2019", "127.0.0.1:2019", "localhost:2019"]}
2021/12/04 22:42:50.870 INFO tls.cache.maintenance started background certificate maintenance {"cache": "0xc0002b8000"}
2021/12/04 22:42:50.871 INFO http server is listening only on the HTTPS port but has no TLS connection policies; adding one to enable TLS {"server_name": "master", "https_port": 443}
2021/12/04 22:42:50.871 INFO http enabling automatic HTTP->HTTPS redirects {"server_name": "master"}
2021/12/04 22:42:50.872 INFO tls cleaning storage unit {"description": "FileStorage:/tmp/caddy-bug"}
2021/12/04 22:42:50.872 INFO tls finished cleaning storage units
2021/12/04 22:42:50.873 INFO http enabling automatic TLS certificate management {"domains": ["nxdomain.test.webmsg.me"]}
2021/12/04 22:42:50.875 INFO tls.obtain acquiring lock {"identifier": "nxdomain.test.webmsg.me"}
2021/12/04 22:42:50.876 INFO autosaved config (load with --resume flag) {"file": "/root/.config/caddy/autosave.json"}
2021/12/04 22:42:50.876 INFO serving initial configuration
2021/12/04 22:42:50.876 INFO watcher watching config file for changes {"config_file": "/home/postmaster/bug/caddy-bug.json"}
2021/12/04 22:42:51.418 INFO tls.obtain lock acquired {"identifier": "nxdomain.test.webmsg.me"}
2021/12/04 22:42:51.426 INFO tls.issuance.acme waiting on internal rate limiter {"identifiers": ["nxdomain.test.webmsg.me"], "ca": "https://acme-v02.api.letsencrypt.org/directory", "account": ""}
2021/12/04 22:42:51.427 INFO tls.issuance.acme done waiting on internal rate limiter {"identifiers": ["nxdomain.test.webmsg.me"], "ca": "https://acme-v02.api.letsencrypt.org/directory", "account": ""}
2021/12/04 22:42:54.259 INFO tls.issuance.acme.acme_client trying to solve challenge {"identifier": "nxdomain.test.webmsg.me", "challenge_type": "http-01", "ca": "https://acme-v02.api.letsencrypt.org/directory"}
2021/12/04 22:42:55.346 ERROR tls.issuance.acme.acme_client challenge failed {"identifier": "nxdomain.test.webmsg.me", "challenge_type": "http-01", "problem": {"type": "urn:ietf:params:acme:error:dns", "title": "", "detail": "DNS problem: NXDOMAIN looking up A for nxdomain.test.webmsg.me - check that a DNS record exists for this domain", "instance": "", "subproblems": []}}
2021/12/04 22:42:55.348 ERROR tls.issuance.acme.acme_client validating authorization {"identifier": "nxdomain.test.webmsg.me", "problem": {"type": "urn:ietf:params:acme:error:dns", "title": "", "detail": "DNS problem: NXDOMAIN looking up A for nxdomain.test.webmsg.me - check that a DNS record exists for this domain", "instance": "", "subproblems": []}, "order": "https://acme-v02.api.letsencrypt.org/acme/order/307839760/44777106950", "attempt": 1, "max_attempts": 3}
2021/12/04 22:42:57.050 INFO tls.issuance.acme.acme_client trying to solve challenge {"identifier": "nxdomain.test.webmsg.me", "challenge_type": "tls-alpn-01", "ca": "https://acme-v02.api.letsencrypt.org/directory"}
2021/12/04 22:42:57.761 ERROR tls.issuance.acme.acme_client challenge failed {"identifier": "nxdomain.test.webmsg.me", "challenge_type": "tls-alpn-01", "problem": {"type": "urn:ietf:params:acme:error:dns", "title": "", "detail": "DNS problem: NXDOMAIN looking up A for nxdomain.test.webmsg.me - check that a DNS record exists for this domain", "instance": "", "subproblems": []}}
2021/12/04 22:42:57.765 ERROR tls.issuance.acme.acme_client validating authorization {"identifier": "nxdomain.test.webmsg.me", "problem": {"type": "urn:ietf:params:acme:error:dns", "title": "", "detail": "DNS problem: NXDOMAIN looking up A for nxdomain.test.webmsg.me - check that a DNS record exists for this domain", "instance": "", "subproblems": []}, "order": "https://acme-v02.api.letsencrypt.org/acme/order/307839760/44777112450", "attempt": 2, "max_attempts": 3}
2021/12/04 22:43:00.356 ERROR tls.obtain could not get certificate from issuer {"identifier": "nxdomain.test.webmsg.me", "issuer": "acme-v02.api.letsencrypt.org-directory", "error": "[nxdomain.test.webmsg.me] solving challenges: nxdomain.test.webmsg.me: no solvers available for remaining challenges (configured=[http-01 tls-alpn-01] offered=[http-01 dns-01 tls-alpn-01] remaining=[dns-01]) (order=https://acme-v02.api.letsencrypt.org/acme/order/307839760/44777117970) (ca=https://acme-v02.api.letsencrypt.org/directory)"}
2021/12/04 22:43:00.360 WARN tls.issuance.zerossl missing email address for ZeroSSL; it is strongly recommended to set one for next time
^C2021/12/04 22:43:00.411 INFO shutting down {"signal": "SIGINT"}
2021/12/04 22:43:00.412 WARN exiting; byeee!! 👋 {"signal": "SIGINT"}
2021/12/04 22:43:00.417 INFO tls.cache.maintenance stopped background certificate maintenance {"cache": "0xc0002b8000"}
2021/12/04 22:43:00.418 ERROR tls.obtain could not get certificate from issuer {"identifier": "nxdomain.test.webmsg.me", "issuer": "acme.zerossl.com-v2-DV90", "error": "account pre-registration callback: performing EAB credentials request: Post \"https://api.zerossl.com/acme/eab-credentials-email\": context canceled"}
2021/12/04 22:43:00.420 ERROR tls.obtain will retry {"error": "[nxdomain.test.webmsg.me] Obtain: account pre-registration callback: performing EAB credentials request: Post \"https://api.zerossl.com/acme/eab-credentials-email\": context canceled", "attempt": 1, "retrying_in": 60, "elapsed": 8.999581352, "max_duration": 2592000}
2021/12/04 22:43:00.420 INFO tls.obtain releasing lock {"identifier": "nxdomain.test.webmsg.me"}
2021/12/04 22:43:00.421 INFO admin stopped previous server {"address": "tcp/localhost:2019"}
2021/12/04 22:43:00.421 INFO shutdown complete {"signal": "SIGINT", "exit_code": 0}
Got an easy reproducer, see first comment.
I believe I ran out of disk space at some point which caused bad lock state. I increased since then the VM disk size, but the bad state was kept in the lock files.
edit: The reason it did not happen every time is that the offending domain name is added later on via the admin API via an external script that needs to run. Or Caddy needs to restore a config with the offending domain.
Okay so it seems to me like there's an infinite loop that needs fixing.
I think it's in certmagic, not Caddy, actually. If you go searching for that error message in the code, you should be able to find it. (I'm away for the weekend, or I'd take closer look myself right now)
I didn't found the message in Caddy's code, it's probably in a dependency, yes
Thanks for the added details, I will circle back around to this when I get a chance! (or someone else can beat me to it, of course)
We aren't checking the error from removeLockFile()
here after that log message is emitted:
So it's possible that the removal is failing and causing an infinite loop.
However, the removeLockFile()
function creates those .unlock
files as you reported earlier, and if those are left hanging around, then we could have another problem because it signals that something else is deleting the lock currently, so we just return right away and iterate the loop. That is probably causing the infinite loop in this case.
It looks like in the original implementation of this atomic unlocking, the point of it being atomic was to clean up empty dirs after removing a lock, but now the caller does that in a separate function of CleanStorage(). So I wonder if we don't even need atomic unlocking anymore. We might be able to simply do os.Remove()
and be ok 99.99~% of the time.
I do not know of a way to do an atomic remove that doesn't involve creating more state that can be interrupted, like was the case here. It's rare, yes; but a lingering .unlock
file is the same as a lingering .lock
file...
Basically, I'm thinking it might be best to just change the "atomic unlock" to a regular os.Remove()
call. Yes, this does mean that if two instances see the stale lock at the exact same time, and one is able to delete it AND create the new one before the other one is able to delete it, then the new lock will be deleted. I view this as probably even rarer than the original bug, and while it would result in two instances doing something at the same time, that's less detrimental for certificates than an infinite loop is.
Thoughts?
I went ahead and committed a fix that I hope will work. @mildred can you give it a try? The mutual exclusion guarantees are weaker now in the presence of stale locks, but I suspect races on those will be rarer than the infinite looping you were experiencing (and less problematic too).
I have a caddy server that is blocked in a loop, generating thousands of message, it seems as fast as it is able to, filling the hard drive with gigabytes of logs until the disk is full.
The log line is:
All log lines are exactly the same and I suspect hundreds lines per second.
Running caddy
v2.3.0 h1:fnrqJLa3G5vfxcxmOH/+kJOcunPLhSBnjgIvjXV/QTA=
obtained withxcaddy build --with github.com/mholt/caddy-l4@master --output /usr/local/bin/caddy
edit: It seems to occur in more recent version of Caddy too, and there are other log lines filling up the logs than this one. In the configuration that causes this error, Caddy by misconfiguration is assigned to obtain TLS certificates that do not redirect to this machine, resulting in a forever failure to obtain the certificates.
1. Environment
1a. Operating system and version
Debian GNU/Linux 11 (bullseye), running a KVM guest (libvirt) on a bare-metal Fedora host hosted on consumer hardware
1b. Caddy version (run
caddy version
or paste commit SHA)v2.4.6 h1:HGkGICFGvyrodcqOOclHKfvJC0qTU7vny/7FhYp9hNw=
1c. Go version (if building Caddy from source; run
go version
)downloaded from https://caddyserver.com/api/download?os=linux&arch=amd64&p=github.com%2Fmholt%2Fcaddy-l4
2. Description
2a. What happens (briefly explain what is wrong)
Caddy generated hundreds of log lines at INFO level, filling up the machine disk. More than 10G in a night.
2c. Log output
Too much to paste in full, see : https://github.com/caddyserver/caddy/issues/4448#issuecomment-985884595
2d. Workaround(s)
~None yet, tweak log rotation to at least be able to purge old logs more easily while keeping some of the logs.~
Cleaning up the locks directory, and especially removing the
*.lock.unlock
files where there are matching JSON*.lock
files2e. Relevant links
N/A
3. Tutorial (minimal steps to reproduce the bug)
Run this shell script (as root to be able to bind port 443):