Closed aarongable closed 3 months ago
Oh yeah that makes sense. If you get like 33 requests within a second, then all 33 of them might be triggering ARI via on-demand maintenance. I think certmagic needs to use https://pkg.go.dev/sync#WaitGroup to make sure it only gets fired off a single time per window.
I think this is a thundering herd issue. We probably need https://pkg.go.dev/golang.org/x/sync/singleflight with the domain as key.
every 6 hours, if respecting the Retry-After header
Is this zoned per queried hostname or per client?
Is this zoned per queried hostname or per client?
Neither, it's per certificate. A single certificate may have multiple hostnames, but also a single client may manage multiple certificates.
I'll take a look into this when I'm back at my desk
The ARI suggestedWindow should be cached for the duration provided by the Retry-After header in the ARI response.
CertMagic does honor the Retry-After header, if present, by calling acme.RenewalInfo.NeedsRefresh()
.
I do agree this is likely a thundering herd, where many calls to update ARI come in before the first one finishes, since it lacks synchronization.
We can synchronize ARI fetching using the configured storage plugin. This will prevent any more than 1 instance in a cluster from fetching ARI at the same time, and after the first one does, the others will load and use its result.
Depending on the storage plugin, it's possible that this locking will be more expensive than actually fetching ARI, but it only happens once in a while, so maybe it's OK.
Thanks for the report! This should fix it but without an offending client to test with I can only guess, but it makes sense to me.
I've synchronized the ARI fetching by the ARI UniqueIdentifier.
Thank you for the investigation and fix! We plan to keep an intermittent eye on ARI traffic patterns for a while, so I'll let you know if I see anything else jump out at me.
As far as I can tell, this isn't fixed by 16e2e0b3443037882be32c731d1e85a90cb69014. I'm able to repro the extra }
error reliably--multiple times per hour--regardless of the lengths of the original and new files, so it's not just a simple truncation issue. It's always one extra }
, even though file sizes often differ by more than one character.
Removing the extraneous closing braces restores sanity, but only briefly. They keep reappearing in new files.
@Zenexer how did you build Caddy?
I don't have my computer handy, but it was a simple Dockerfile with the usual xcaddy --with syntax, certmagic@master.
% docker compose exec caddy caddy build-info | grep -F certmagic
dep github.com/caddyserver/certmagic v0.21.4-0.20240905165329-aad674cda5ba h1:p4QJmH7xVg2OhcbHBP8dGTdl6raoVQMA8IHjDka6R0Y=
FROM caddy:builder AS builder
RUN xcaddy build \
--with github.com/caddyserver/certmagic@master
FROM caddy:latest
COPY --from=builder /usr/bin/caddy /usr/bin/caddy
EXPOSE 80
EXPOSE 443
EXPOSE 443/udp
@Zenexer Did you mean to comment on a different issue (#303)? This one is about superfluous ARI requests, not malformed files.
No, I was going under the assumption that the author was correct in linking them--which could be incorrect.
I see. You might actually want this commit: https://github.com/caddyserver/certmagic/pull/300
If that's on master, I would've included it just to be safe.
What version of the package are you using?
User Agent: "CertMagic acmez (linux; 386)"
What are you trying to do?
Look at outliers in Let's Encrypt's ARI request data.
What steps did you take?
Queried LE's observability database for how many times ARI is queried for each certificate.
What did you expect to happen, and what actually happened instead?
I expect CertMagic to query ARI on a regular basis (e.g. every 6 hours, if respecting the Retry-After header), perhaps with some jitter to prevent clustering.
Instead, I see that CertMagic clients are requesting ARI data for a single serial 60+ times in a single day, including bursts of up to 33 requests in a single second.
How do you think this should be fixed?
I suspect this may be related to querying ARI in the "on-demand TLS" code path, and the misbehaving servers may be getting crawled and generating many requests due to that.
The ARI suggestedWindow should be cached for the duration provided by the Retry-After header in the ARI response.
Please link to any related issues, pull requests, and/or discussion
https://github.com/caddyserver/certmagic/pull/286