caddyserver / caddy

Fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS
https://caddyserver.com
Apache License 2.0
58.22k stars 4.03k forks source link

No matching certificate to load: decoding certificate metadata: invalid character '}' #6481

Closed solracsf closed 2 months ago

solracsf commented 3 months ago

Caddy v2.8.4 started printing an error for one domain:

Jul 27 12:06:59 caddy-proxy-sbg7 caddy[1208]: {"level":"info","ts":1722074819.7958932,"logger":"tls.on_demand","msg":"obtaining new certificate","remote_ip":"77.141.97.142","remote_port":"49984","server_name":"server.example.com"}
Jul 27 12:06:59 caddy-proxy-sbg7 caddy[1208]: {"level":"error","ts":1722074819.8085268,"logger":"tls.on_demand","msg":"loading newly-obtained certificate from storage","remote_ip":"77.141.97.142","remote_port":"49984","server_name":"server.example.com","error":"no matching certificate to load for server.example.com: decoding certificate metadata: invalid character '}' after top-level value"}
{
        "sans": [
                "server.example.com"
        ],
        "issuer_data": {
                "url": "https://acme-v02.api.letsencrypt.org/acme/cert/042fb47e52c51ab43af43257f1ba5e8",
                "ca": "https://acme-v02.api.letsencrypt.org/directory",
                "renewal_info": {
                        "suggestedWindow": {
                                "start": "2024-09-23T06:58:45.333333334Z",
                                "end": "2024-09-25T06:58:45.333333334Z"
                        },
                        "_uniqueIdentifier": "kydGmAOpUWiOmNbEI.BC-0flLFGrQ69DL_idV_G6Xo",
                        "_retryAfter": "2024-07-27T09:40:01.93861755+02:00",
                        "_selectedTime": "2024-09-24T03:22:43Z"
                }
        }
}}

Removing the extra } at the bottom of the file /caddy/certificates/acme-v02.api.letsencrypt.org-directory/server.example.com/server.example.com.json fixes the error.

mholt commented 3 months ago

Hey Skifree, thanks for reporting this. :wink:

That's weird! @elee1766 Is this possibly due to a race condition you were referring to in Slack? I don't think I've seen a malformed JSON file before, only an empty one.

elee1766 commented 3 months ago

this looks like one file is written on top of another without clearing the first once.

if read is called after fsync is called after truncate and after write but before flush/close, this could happen I think? the new cert would have to be one character smaller than the old cert. (is this possible? are any fields variable length?).

but it's reading the old file size, so maybe something else happened at write time. regardless, something happened during file write, maybe also during a read.

but yes, my guess is the temp file approach fixes this @mholt

I doubt go's json encoder would do this.

mholt commented 3 months ago

@elee1766 Thanks for your input! Yeah, I think the _uniqueIdentifier may possibly vary by 1-2 chars. But I'm not 100% sure on that. It's related to the ASN.1 encoding of the cert's serial number.

But anyway, it sounds like we have a fix in the pipeline for this already. :)

mholt commented 3 months ago

@solracsf I don't suppose you happen to have the previous JSON file for this cert?

elee1766 commented 3 months ago

also, did anything weird happen to the server ? like a reboot or process restart or config reload, when it started happening? also, are you running only one instance of caddy?

there's also a third variable - maybe two certs could have been being written at once somehow. iirc the lock in filestorage is not perfect. @mholt isn't this possible? (for the imperfect file based lock to cause parallel write?)

solracsf commented 3 months ago

I have 2 caddy instances, 2 distinct servers. Caddy storage is a JuiceFS mount. JuiceFS supports both BSD locks (flock) and POSIX record locks (fcntl).

Problem started when servers rebooted (manually, one after another, to apply kernel updates). This has been done before many times without any issues.

# GLOBAL options
{
        email email@netc.eu

        storage file_system {
                root /mnt/caddyvol/caddy
        }

        servers {
                strict_sni_host on
                protocols h1 h2
                trusted_proxies_strict
        }

        on_demand_tls {
                ask http://ask.localhost/check
        }
}
elee1766 commented 3 months ago

@mholt yes, at this point i'm convinced the tempfile patch would make this not happen. it looks like two caddy instances were writing to the same file at the same time.

@solracsf here is the relevant PR, if you are curious https://github.com/caddyserver/certmagic/pull/300

from what i see - juicefs rename is atomic? so this PR should work for you.

btw, certmagic storage doesn't use advisory locks yet. i opened an issue about this https://github.com/caddyserver/certmagic/issues/295 . I personally dont have a use case for this, since i do not use the filesystem to share certs across multiple instances, so i didn't end up finishing this.

if you have the time it would be nice to add this as a feature. matt said he is happy to accept a PR to correctly use filesystem level locks instead of the fake-lock that filesystem storage currently uses. I would be happy to review as well. basically just need to check on boot if the filesystem supports a locking method, and use it, otherwise fallback to the existing strategy

that said, i would seriously recommend just using redis with https://github.com/pberkel/caddy-storage-redis or https://github.com/ss098/certmagic-s3 with juicefs s3 gateway, or writing your own cert-magic plugin for s3 for any production settings.

solracsf commented 3 months ago

that said, i would seriously recommend just using redis with https://github.com/pberkel/caddy-storage-redis or https://github.com/ss098/certmagic-s3 with juicefs s3 gateway, or writing your own cert-magic plugin for s3 for any production settings.

I understand, but we prefer not adding plugins to caddy. In our env, JuiceFS handles terabytes of data without any issue since 2y now, we prefere keep it as our filesystem abstraction layer and rely on Caddy certmagic filesystem.

We'll keep an aye on https://github.com/caddyserver/certmagic/pull/300 and provide feedback. 👍

mholt commented 3 months ago

@solracsf You can try it now, if you build Caddy with xcaddy build --with github.com/caddyserver/certmagic=github.com/caddyserver/certmagic@master.

solracsf commented 3 months ago

Will be hard to say if it fixes anything soon, because caddy was installed and runing for 2y with this setup without any issues 🤔 Nevertheless, I've built our 2 servers with latest certmagic@master and report back if this happens again. Feel free to close this issue for now.

lqs commented 2 months ago

If you are running multiple instances of Caddy with shared storage, you may encounter a locking issue introduced in certmagic v0.21.

See https://github.com/caddyserver/certmagic/issues/303

mholt commented 2 months ago

Yeah, sorry; I'm behind on releases. Trying to catch up on things!

aa-shalin commented 3 weeks ago

hey @mholt, i'm facing the exact same issue. I recently saw some JSON files containing an additional } at the end of file which led to this error being thrown by Caddy

{"error":"no matching certificate to load for insights.example.se: decoding certificate metadata: invalid character '}' after top-level value", "level":"debug", "logger":"tls.handshake", "msg":"did not load cert from storage", "remote_ip":"10.142.0.147", "remote_port":"11452", "server_name":"insights.example.se", "ts":1.7288959881120608E9}

Do we have any fix for this issue? I'm already using Caddy v2.8.4 and running multiple instances of Caddy with a shared NFS volume.

Would I be able to solve this issue by moving to Google Cloud Storage/S3?

francislavoie commented 3 weeks ago

The workaround is to wipe out Caddy's stroage and restart Caddy. It'll reissue all your certificates, but that should be fine assuming you have only a handful. Or, you can dive into Caddy's storage and look at all the .json files, fix their syntax manually if possible (or delete the broken ones to get Caddy to regenerate them).

In v2.9.0 it should be fixed by https://github.com/caddyserver/certmagic/pull/300

aa-shalin commented 3 weeks ago

The workaround is to wipe out Caddy's stroage and restart Caddy. It'll reissue all your certificates, but that should be fine assuming you have only a handful.

In v2.9.0 it should be fixed by caddyserver/certmagic#300

Thank you @francislavoie! I'll upgrade to v2.9.0 and check all the JSON files as i have > 1k certs.

aa-shalin commented 3 weeks ago

@francislavoie I see that v2.9.0 is still in beta, could I build v2.8.4 with certmagic@mater instead? i'm already building Caddy binary using xcaddy

francislavoie commented 3 weeks ago

Upgrading won't fix the broken storage, it'll just fix further breakages of the storage. You could use the beta, it's pretty stable, just missing a few things we wanted to fit into the final release.