caddyserver / certmagic

Automatic HTTPS for any Go program: fully-managed TLS certificate issuance and renewal
https://pkg.go.dev/github.com/caddyserver/certmagic?tab=doc
Apache License 2.0
5k stars 289 forks source link

Sometimes generating 33 ARI requests in a single second #297

Closed aarongable closed 3 months ago

aarongable commented 3 months ago

What version of the package are you using?

User Agent: "CertMagic acmez (linux; 386)"

What are you trying to do?

Look at outliers in Let's Encrypt's ARI request data.

What steps did you take?

Queried LE's observability database for how many times ARI is queried for each certificate.

What did you expect to happen, and what actually happened instead?

I expect CertMagic to query ARI on a regular basis (e.g. every 6 hours, if respecting the Retry-After header), perhaps with some jitter to prevent clustering.

Instead, I see that CertMagic clients are requesting ARI data for a single serial 60+ times in a single day, including bursts of up to 33 requests in a single second.

How do you think this should be fixed?

I suspect this may be related to querying ARI in the "on-demand TLS" code path, and the misbehaving servers may be getting crawled and generating many requests due to that.

The ARI suggestedWindow should be cached for the duration provided by the Retry-After header in the ARI response.

Please link to any related issues, pull requests, and/or discussion

https://github.com/caddyserver/certmagic/pull/286

francislavoie commented 3 months ago

Oh yeah that makes sense. If you get like 33 requests within a second, then all 33 of them might be triggering ARI via on-demand maintenance. I think certmagic needs to use https://pkg.go.dev/sync#WaitGroup to make sure it only gets fired off a single time per window.

mohammed90 commented 3 months ago

I think this is a thundering herd issue. We probably need https://pkg.go.dev/golang.org/x/sync/singleflight with the domain as key.

every 6 hours, if respecting the Retry-After header

Is this zoned per queried hostname or per client?

aarongable commented 3 months ago

Is this zoned per queried hostname or per client?

Neither, it's per certificate. A single certificate may have multiple hostnames, but also a single client may manage multiple certificates.

mholt commented 3 months ago

I'll take a look into this when I'm back at my desk

mholt commented 3 months ago

The ARI suggestedWindow should be cached for the duration provided by the Retry-After header in the ARI response.

CertMagic does honor the Retry-After header, if present, by calling acme.RenewalInfo.NeedsRefresh().

I do agree this is likely a thundering herd, where many calls to update ARI come in before the first one finishes, since it lacks synchronization.

We can synchronize ARI fetching using the configured storage plugin. This will prevent any more than 1 instance in a cluster from fetching ARI at the same time, and after the first one does, the others will load and use its result.

Depending on the storage plugin, it's possible that this locking will be more expensive than actually fetching ARI, but it only happens once in a while, so maybe it's OK.

mholt commented 3 months ago

Thanks for the report! This should fix it but without an offending client to test with I can only guess, but it makes sense to me.

I've synchronized the ARI fetching by the ARI UniqueIdentifier.

aarongable commented 3 months ago

Thank you for the investigation and fix! We plan to keep an intermittent eye on ARI traffic patterns for a while, so I'll let you know if I see anything else jump out at me.

Zenexer commented 1 week ago

As far as I can tell, this isn't fixed by 16e2e0b3443037882be32c731d1e85a90cb69014. I'm able to repro the extra } error reliably--multiple times per hour--regardless of the lengths of the original and new files, so it's not just a simple truncation issue. It's always one extra }, even though file sizes often differ by more than one character.

Removing the extraneous closing braces restores sanity, but only briefly. They keep reappearing in new files.

francislavoie commented 1 week ago

@Zenexer how did you build Caddy?

Zenexer commented 1 week ago

I don't have my computer handy, but it was a simple Dockerfile with the usual xcaddy --with syntax, certmagic@master.

Zenexer commented 1 week ago
% docker compose exec caddy caddy build-info | grep -F certmagic
dep     github.com/caddyserver/certmagic        v0.21.4-0.20240905165329-aad674cda5ba   h1:p4QJmH7xVg2OhcbHBP8dGTdl6raoVQMA8IHjDka6R0Y=
FROM caddy:builder AS builder

RUN xcaddy build \
    --with github.com/caddyserver/certmagic@master

FROM caddy:latest

COPY --from=builder /usr/bin/caddy /usr/bin/caddy

EXPOSE 80
EXPOSE 443
EXPOSE 443/udp
mholt commented 1 week ago

@Zenexer Did you mean to comment on a different issue (#303)? This one is about superfluous ARI requests, not malformed files.

Zenexer commented 1 week ago

No, I was going under the assumption that the author was correct in linking them--which could be incorrect.

mholt commented 1 week ago

I see. You might actually want this commit: https://github.com/caddyserver/certmagic/pull/300

Zenexer commented 1 week ago

If that's on master, I would've included it just to be safe.