Corrupted metadata JSON files caused by bug #297

lqs commented 4 months ago

The bug #297 frequently leads to corrupted JSON files when multiple instances mount a shared NFS directory as storage. We’ve encountered numerous cases in production where this causes certificate failures with the error message decoding certificate metadata: invalid character '}' after top-level value, rendering the affected sites completely unusable.

The number of corrupted files is increasing, and I can’t restart the service.

I see the bug fix is in the master branch. Please release a new version that includes this fix. Additionally, how can I identify and repair the already corrupted files among thousands without deleting all files? Alternatively, is there a way to ignore corrupted files during loading?

francislavoie commented 4 months ago

If you're in a hurry, you can make a build yourself with this:

xcaddy build --with github.com/caddyserver/certmagic@16e2e0b

mholt commented 1 month ago

Curious what evidence you have that this and #297 are related -- how do you know that superfluous ARI requests is corrupting files?

lqs commented 1 month ago

I’m using certmagic as a library for a web server that serves front-end files and supports custom domains. When the error occurred, I checked the logs and file modification times and found that an extra ‘}’ appeared after multiple ARI updates. After reviewing the source code, I suspect this issue is due to concurrent writes to NFS.

Prior to the error, I was already preparing to migrate from NFS to S3 and implemented a custom storage to access S3. After the error, I expedited the migration process. Since S3 writes are atomic, it prevented the issue, even though redundant ARI requests still remain.

mholt commented 1 month ago

NFS has known bugs related to synchronization, that might be the actual problem.

S3 does not provide atomic operations for us to be able to safely offer synchronization, even if writes are synced. I recommend using a database like MySQL/Postgres/Redis for high concurrency distributed storage.

mholt commented 1 month ago

Moving another discussion with @Zenexer into here:

As far as I can tell, this isn't fixed by 16e2e0b3443037882be32c731d1e85a90cb69014. I'm able to repro the extra } error reliably--multiple times per hour--regardless of the lengths of the original and new files, so it's not just a simple truncation issue. It's always one extra }, even though file sizes often differ by more than one character.

Removing the extraneous closing braces restores sanity, but only briefly. They keep reappearing in new files.

Originally posted by @Zenexer in #297

Anyway, @lqs, from what you're saying:

Using NFS, files have extra }
Using S3, files don't have extra } but still redundant ARI requests still happen

This actually checks out with known issues with both of those storage backends (as noted just above). NFS has sync/flush issues when it comes to concurrent users over a network; and S3 doesn't provide atomic operations, so proper locking/syncing of an operation like an ARI request is impossible.

@Zenexer, are you also using NFS perchance?

Zenexer commented 1 month ago

I spent about a dozen hours debugging this yesterday, and I believe my initial comment was incorrect: rather than the bug persisting, I believe I just hadn't sufficiently cleaned all of the existing corrupt files. There were situations in which there were two trailing bytes at the end of a file (\n}), and my cleanup script didn't account for that.

I am using NFS, but it does appear to support locking correctly with my current mount options--or, at least, in a way that is compatible with this patch. I'm not a huge fan of NFS and generally don't trust it, but it should work with this lock/write pattern. I doubt it would ever make sense to officially support NFS given how fickle it is, but the locking code in certmagic is straightforward enough that I should be able to debug and patch it if there are further issues.

The one thing that still has me a little worried is a disconcerting number of requests to the on-demand ask endpoint. The docs make it sound as though that's to be expected, but it's accompanied by a large number of log entries related to ARI. I don't think it's a bug--it's probably just a coincidence--but I happened to notice it while troubleshooting.

francislavoie commented 1 month ago

That's good to hear! Yours was the only feedback so far that it didn't fix the issue, so it's reassuring that it was an oversight.

mholt commented 1 month ago

That's a relief, thanks for the follow-up.

The one thing that still has me a little worried is a disconcerting number of requests to the on-demand ask endpoint. The docs make it sound as though that's to be expected, but it's accompanied by a large number of log entries related to ARI. I don't think it's a bug--it's probably just a coincidence--but I happened to notice it while troubleshooting.

The ask endpoint can be busy... we could potentially ease this with a bloom filter or something, that we just reset every 5 or 10 minutes (or something like that). But ideally I'd rather the ask endpoint itself do the caching since it knows better logic than we can guess.

I'd be curious if the ARI log entries are redundant (same hostname) or not. I really want that to be fixed (AFAIK it should be already).

Zenexer commented 1 month ago

The ask endpoint can be busy... we could potentially ease this with a bloom filter or something, that we just reset every 5 or 10 minutes (or something like that). But ideally I'd rather the ask endpoint itself do the caching since it knows better logic than we can guess.

I would assume that any Caddy user with that sort of traffic probably has caching on their ask endpoint anyway and can keep it fast. From my perspective, though, I'd like to be able to log exactly when I've told Caddy it was authorized to go out and request a certificate: having that logging on my application helps me with troubleshooting, since I can use that to determine where in the stack a problem is arising that might be leading to excessive certificate requests. As it currently stands, I don't know whether an ask is for a certificate request/renewal, an ARI request, or some other maintenance task being performed by Caddy--it's a black box.

I'd be curious if the ARI log entries are redundant (same hostname) or not. I really want that to be fixed (AFAIK it should be already).

I'll try to figure that out, but Caddy is spitting out gigabytes of log data with on-demand TLS enabled, so I'm still trying to sort out what's important and what's not.

Zenexer commented 1 month ago

I think most of the log entries are the result of various hosting providers and registrars trying to request or renew certificates for domains that no longer point to them, with no checking on their end prior to starting the ACME challenge process. That makes it really difficult to tell the difference between legitimate ACME-related error messages and errors that I can safely ignore. (Ugh, I really wish CAs wouldn't waive rate limiting on challenge failures for large integrators--it hurts everyone.) I don't think that should affect ARI log messages, so I'll let docker compose logs -fn0 | grep -F '"updated ACME renewal information"' and see what turns up.

Zenexer commented 1 month ago

I'm not seeing any overlap between ARI requests so far. Each one is unique.

mholt commented 1 month ago

I'd like to be able to log exactly when I've told Caddy it was authorized to go out and request a certificate; having that logging on my application helps me with troubleshooting, since I can use that to determine where in the stack a problem is arising that might be leading to excessive certificate requests. As it currently stands, I don't know whether an ask is for a certificate request/renewal, an ARI request, or some other maintenance task being performed by Caddy--it's a black box.

To make sure I understand, you want a way for the 'ask' request to distinguish whether a certificate is being obtained or something else?

The only times the 'ask' endpoint is invoked are currently when a certificate needs to be obtained or renewed. It does not guard ARI requests or other maintenance, per-se, though in theory it should be guarding them implicitly, because if you cannot obtain or renew a cert, you cannot maintain it either.

Technically, 'ask' is invoked before even trying to load a certificate from storage (as that can be expensive depending on the storage backend).

So I guess, to your request, I would say that it shouldn't matter, but I'm open to discussing this further if desired.

I think most of the log entries are the result of various hosting providers and registrars trying to request or renew certificates for domains that no longer point to them, with no checking on their end prior to starting the ACME challenge process. That makes it really difficult to tell the difference between legitimate ACME-related error messages and errors that I can safely ignore. (Ugh, I really wish CAs wouldn't waive rate limiting on challenge failures for large integrators--it hurts everyone.) I don't think that should affect ARI log messages, so I'll let docker compose logs -fn0 | grep -F '"updated ACME renewal information"' and see what turns up.

I was one who advocated for exemptions to the rate limits when conforming to ARI, out of concerns that certificate renewals would be rejected -- sometimes past their expiration -- on account of rate limits, even though it was the CA who specified the renewal window. So to ensure certificates can be renewed even if they have to be squished into a narrow window, Let's Encrypt (rightly) exempts clients from rate limits in that situation. Why do you think it hurts everyone?

Is Caddy attempting to renew lots of certificates for you and failing?

I'm not seeing any overlap between ARI requests so far. Each one is unique.

That's good, so it sounds like the synchronization is working. :+1: Thanks for checking on that.

Zenexer commented 1 month ago

To make sure I understand, you want a way for the 'ask' request to distinguish whether a certificate is being obtained or something else?

Yes, mostly for debugging purposes. If I see that multiple Caddy instances are all trying to get certs at the same time, that's a sign something is amiss. It would likely help when troubleshooting future concurrency issues, but a lightweight plugin could probably serve the same purpose.

I'm still not 100% confident the remaining errors I'm seeing are benign, but I'll have more data over the next few days.

Zenexer commented 1 month ago

I was one who advocated for exemptions to the rate limits when conforming to ARI, out of concerns that certificate renewals would be rejected -- sometimes past their expiration -- on account of rate limits, even though it was the CA who specified the renewal window. So to ensure certificates can be renewed even if they have to be squished into a narrow window, Let's Encrypt (rightly) exempts clients from rate limits in that situation. Why do you think it hurts everyone?

Sorry, I realized I forgot to answer this question. I don't have an opinion on the scenario you mentioned. What appears to be happening is twofold:

foo.example used to point to <very large hosting provider>, but now points to me. Said hosting provider doesn't even bother to check whether the domain points to them before trying to renew their certificates. It might not have pointed to them for months, and they just don't have any reason to bother cleaning up.
Vulnerability scanners are hitting /.well-known/acme-challenge/*

Access logs have since shown that the second issue accounts for the majority of the "no challenge data found" warnings I was seeing. Disabling HTTP-01 mostly resolved that.

The first issue is far more problematic: I can't stop other hosting providers from requesting certs, and they're wasting PKI resources. It also wastes my time because they cause concerning log entries. These large hosting providers don't really have any incentive to check whether hosts point to them before requesting a cert; they just offload that to the CA.

Meanwhile, I have to check that a domain points to me--and only to me--when Caddy hits the ask endpoint. Caddy's docs explicitly say I shouldn't do this, but I can't go around requesting certs for domains just because I think they might point to me. I have to actually verify that, then cache the results for a while. If I don't do that, I'll get rate limited pretty quickly.

mholt commented 1 month ago

@Zenexer It sounds like your ask endpoint needs to check to make sure you are expecting to be getting a certificate for those domain names.

I have to check that a domain points to me--and only to me--when Caddy hits the ask endpoint.

Not exactly; why not check your database (or whatever is relevant to your application/service) to see if you should be expecting to maintain a certificate for a hostname? That's the purpose of the ask endpoint and it should resolve the rate limit problems, yeah?

Zenexer commented 1 month ago

It sounds like your ask endpoint needs to check to make sure you are expecting to be getting a certificate for those domain names.

It already does that. I am expecting to get certificates for those domains.

Scenario 1: I control example.com. Large hosting provider used to control example.com, and they used to have certificates for it. They keep trying to renew those certificates. My ask endpoint needs to return 200, because I need certificates.

Scenario 2: I control example.com. Someone runs Acunetix against example.com. For whatever reason, it starts brute forcing random paths with a prefix of /.well-known/acme-challenge/--for example, /.well-known/acme-challenge/xmlrpc.php. I want certificates for example.com, so the ask endpoint returns 200.

Not exactly; why not check your database (or whatever is relevant to your application/service) to see if you should be expecting to maintain a certificate for a hostname? That's the purpose of the ask endpoint and it should resolve the rate limit problems, yeah?

I do. However, I'm running a free service to which non-technical users can point their domains. They might use my service for a while, decide they don't like it, and point their domain elsewhere. I can't trust that the users are going to maintain a list of domains that point to me, so I do have to validate the A/AAAA records before requesting a certificate. The result gets cached in Redis for a while, so subsequent calls to the ask endpoint are very fast.

aaomidi commented 1 month ago

I think to simplify this, what @Zenexer is stating is that a third party actor can initiate a request to /.well-known/acme-challenge/$token - and caddy will hit the ask endpoint even if caddy knows that the challenge is bogus, or the challenge is real but caddy can't solve it (e.g, it was a token that caddy does not know about).

I believe for security reasons caddy already needs to keep a track of the list of tokens its generated through applying for ACME challenges. Checking for a string in a list of strings is probably significantly faster to do first, before sending a request to the ask endpoint (even if its not recommended for the ask endpoint to take long, its still a request to a third party thing).

Would it make sense to flip the order of operations here? Caddy does the first initial sanity check (e.g. "I don't even know what this token is, trash it?)

I guess if you have a very slow storage driver, then maybe that operation will be slower... but maybe my naive view is that the storage driver is likely to be faster than the ask endpoint the vast majority of the time, and if the ask request returns a 200 then you're still going to need to hit the storage driver anyway.

Beyond that there's also the problem that ask is operating with far, far less information than caddy is. I suppose the ask request can hook into the same storage and check for the token being there or not. But for that to work, the token would also need to be part of the ask payload.

francislavoie commented 1 month ago

I believe for security reasons caddy already needs to keep a track of the list of tokens its generated through applying for ACME challenges.

It does, but that's kept in the storage. What Matt is saying is that in some setups, the storage lookup is more expensive than the ask lookup.

Caddy can be run in a cluster, so it must use the storage to see whether another Caddy instance initiated issuance. It can't rely on an in-memory cache.

The ask endpoint should only do a DB lookup, it should not have side effects. If it has side effects, it's incorrectly implemented.

Would it make sense to flip the order of operations here? Caddy does the first initial sanity check (e.g. "I don't even know what this token is, trash it?)

That's what it was originally until about a year ago, storage was checked first, but that was bad for some users.

Zenexer commented 1 month ago

The ask endpoint should only do a DB lookup, it should not have side effects. If it has side effects, it's incorrectly implemented.

That's impossible. It has to do DNS lookups at any sort of scale, despite the documentation. There's just no use case for on-demand TLS that doesn't necessitate frequently double-checking DNS. Calling it incorrect isn't helpful.

francislavoie commented 1 month ago

You should not be doing DNS checks. That doesn't make sense. It's not your ask endpoint's responsibility to do that, it's Caddy's. You should have an allow list of domains in your database that you compare against.

aaomidi commented 1 month ago

That's impossible. It has to do DNS lookups at any sort of scale

Not really, this fully depends on the use case defined here imo. In your use case, you're going to need to do a DNS lookup, but that's not necessarily universally applicable.

That's what it was originally until about a year ago, storage was checked first, but that was bad for some users.

Yeah that makes sense tbh. Computers suck.

Maybe the ability to choose at which stage ask is performed can be useful?

Zenexer commented 1 month ago

Not really, this fully depends on the use case defined here imo.

I'm having a hard time envisioning common use cases for on-demand TLS in which the operator of Caddy has exclusive ownership and control over all of the domain names pointed to it.

The obvious use case seems to be a hosting provider or integrator. They need to regularly verify that any domains provided by end users actually, truly point to them before requesting certificates. Caching that information for too long is risky, especially since DNS is prone to misconfiguration.

aaomidi commented 1 month ago

I'm having a hard time envisioning common use cases for on-demand TLS in which the operator of Caddy has exclusive ownership and control over all of the domain names pointed to it.

Company setting where they have one global load balancer, but a secondary system checks that a domain is owned by the company before being put into a given database.

This is different from your use case where you let arbitrary users point their domains to your infrastructure with no registration/signup process (e.g. you don't know whats even linked to you until you get a request!)

Both of these use cases should work IMO

francislavoie commented 1 month ago

@Zenexer What you should do is have your customers register their domain via your app's settings, and you add it to your DB allow list. Then all the ask endpoint does is compare against that list. That's how it's meant to work.

aaomidi commented 1 month ago

@Zenexer’s use case allows people to point their domains to their infrastructure without any registration, etc.

From what I’m understanding, there is no registration or direct user involvement. The only involvement is going to DNS and changing name servers or A records.

Zenexer commented 1 month ago

@Zenexer What you should do is have your customers register their domain via your app's settings, and you add it to your DB allow list. Then all the ask endpoint does is compare against that list. That's how it's meant to work.

That's the erroneous assumption that leads to issue 1. Hosting providers assume that just because a domain is registered with them, they will pass DV.

I don't make that assumption. Hosting providers often do, which is why we're here. Caddy sees ACME challenges that were started by other hosting providers who don't bother to check DNS before requesting a cert.

francislavoie commented 1 month ago

I don't understand. Your ask endpoint would reject it so Caddy wouldn't issue a cert. I don't see how what "hosting providers" do matters here.

Zenexer commented 1 month ago

I don't understand. Your ask endpoint would reject it so Caddy wouldn't issue a cert. I don't see how what "hosting providers" do matters here.

Hosting Provider A runs Caddy with on-demand TLS.
Hosting Provider B also runs Caddy with on-demand TLS.

Alice owns example.com.
Alice points example.com to Hosting Provider A.
Hosting Provider A tries to get a cert for example.com.
Hosting Provider A's ask endpoint returns 200.
Let's Encrypt issues a cert to Hosting Provider A.
Alice changes example.com's DNS to point to Hosting Provider B.
Alice doesn't tell Hosting Provider A that example.com has been moved.
Caddy at Hosting Provider A tries to renew its cert.
Hosting Provider A's ask endpoint sees example.com in its database and returns 200.
Caddy at Hosting Provider A goes to Let's Encrypt and starts the challenge process.
Let's Encrypt attempts to complete the HTTP-01 challenge by making a request to http://example.com/.well-known/acme-challenge/something.
Caddy at Hosting Provider B receives this challenge request.
Caddy at Hosting Provider B receives a 200 response from its ask endpoint.
Caddy at Hosting Provider B checks for challenge data in its storage, but finds nothing. It logs a warning.

This happens dozens of times per second. I'm Hosting Provider B in this scenario.

If I rely on my database, I become Hosting Provider A.

aaomidi commented 1 month ago

I don't understand. Your ask endpoint would reject it so Caddy wouldn't issue a cert. I don't see how what "hosting providers" do matters here.

There is no external database to check in this circumstance. There is no UI. There is no app.

There is a page with instructions: if you want to park your domain with ${whatever}, please point your A and AAAA records to ${whatever}.

The database is DNS.

The answer here may be that this is not a supported use case of caddy, but imo this can very easily be a supported use case with slight modifications.

mholt commented 1 month ago

Just catching up after feeding the baby and running some kids around... sorry!

I've been drafting this reply while several new replies have come in -- I wish GitHub would show that someone was replying or at least show the new replies. I feel like the conversation went off-track and got confused by some things, but maybe I did instead. In any case, here's my attempt to bring it back:

@Zenexer

Thanks for clarifying above. It seems to me you have an extraordinary situation that is not common from what I know of existing large-scale CertMagic deployments. That's not bad, just something that is worthy of discussion/understanding.

Scenario 1: I control example.com. Large hosting provider used to control example.com, and they used to have certificates for it. They keep trying to renew those certificates.

So, in this case, the hosting provider would fail their own ACME challenge, but your server would likely get pinged with a TLS handshake or HTTP request in an attempt to solve the challenge.

The HTTP-01 challenge does not use TLS, so those would not issue a certificate. You'd see junk in your logs, but what's new (it's the Internet).

The TLS-ALPN-01 challenge does use TLS, but with a special ALPN value. When it sees a handshake of this sort, it only follows a special code path that serves the challenge solution certificate (if it doesn't find one it just returns an error and aborts the handshake).

Neither case will initiate an ACME challenge that you end up failing and getting rate limited for. (If they are, that's a bug I'd like more details, likely in a separate issue.)

Scenario 2: I control example.com. Someone runs Acunetix against example.com. For whatever reason, it starts brute forcing random paths with a prefix of /.well-known/acme-challenge/--for example, /.well-known/acme-challenge/xmlrpc.php. I want certificates for example.com, so the ask endpoint returns 200.

@aaomidi

I think to simplify this, what @Zenexer is stating is that a third party actor can initiate a request to /.well-known/acme-challenge/$token - and caddy will hit the ask endpoint even if caddy knows that the challenge is bogus, or the challenge is real but caddy can't solve it (e.g, it was a token that caddy does not know about).

The ask endpoint is only invoked if the client tries to establish a TLS handshake using a domain name it does not have a certificate for (and isn't itself a challenge handshake). But this endpoint is for the HTTP-01 challenge, which is HTTP-only. Are the servers accessing this plaintext endpoint over HTTPS? That would be the only way this is possible, but is extremely broken / in violation of spec.

Zenexer commented 1 month ago

The ask endpoint is only invoked if the client tries to establish a TLS handshake using a domain name it does not have a certificate for (and isn't itself a challenge handshake).

That seems to run contrary to what was said elsewhere in this thread: it seems that the ask endpoint is being hit prior to checking storage for the existence of a certificate. If that's not the case, there might still be a concurrency issue somewhere that needs debugging.

mholt commented 1 month ago

Catching up on the flood of new replies while I was typing... and also revisiting some of the earlier replies...

@Zenexer

The first issue is far more problematic: I can't stop other hosting providers from requesting certs, and they're wasting PKI resources.

This should only be negatively affecting them (and I guess the CAs, which is why they rate limit them). If they initiate an ACME challenge for a domain pointed to you, you will see junk log entries at most. Or what wasted PKI resources are you referring to?

I do. However, I'm running a free service to which non-technical users can point their domains. They might use my service for a while, decide they don't like it, and point their domain elsewhere. I can't trust that the users are going to maintain a list of domains that point to me, so I do have to validate the A/AAAA records before requesting a certificate.

I should have mentioned that this is what I was referring to that is a bit unconventional. Typically, users sign up with their domain name(s) they are going to use, then your service knows what the domains are. If setting DNS records is itself the act of "signing up" with your service, then it makes sense that you have to check DNS records. This is the only time I have heard of this being the case... and for a few reasons (complexity, reliability, etc) I don't recommend this... unless, perhaps, you do like what our Caddy homepage demo does, where a user can point a specific subdomain ("caddydemo" in our case) to your IP, and then no signup is required. This still prevents abuse because it only works for one specific identifier per registered domain, hence there's a relatively significant cost barrier to abuse it.

I don't make that assumption. Hosting providers often do, which is why we're here. Caddy sees ACME challenges that were started by other hosting providers who don't bother to check DNS before requesting a cert.

To clarify once more, if someone else initiates an ACME challenge for a hostname that fails repeatedly, that doesn't rate limit you. It only rate limits their ACME account, not yours.

@aaomidi (and of course @francislavoie) Thanks for chiming into this discussion. Your feedback and perspectives are much appreciated!!

@aaomidi

I believe for security reasons caddy already needs to keep a track of the list of tokens its generated through applying for ACME challenges. Checking for a string in a list of strings is probably significantly faster to do first, before sending a request to the ask endpoint (even if its not recommended for the ask endpoint to take long, its still a request to a third party thing).

Would it make sense to flip the order of operations here? Caddy does the first initial sanity check (e.g. "I don't even know what this token is, trash it?)

When CertMagic initiates an ACME challenge, it puts the challenge info in storage so that any other instances in the cluster can solve the challenge (as opposed to just in process memory). So when a challenge request comes in, we don't know whether it's junk until we access storage.

Beyond that there's also the problem that ask is operating with far, far less information than caddy is. I suppose the ask request can hook into the same storage and check for the token being there or not. But for that to work, the token would also need to be part of the ask payload.

This is true, and we could add more info here; however, I've tried to keep the abstraction as pure as possible with the understanding that it shouldn't matter: is this identifier known to / allowed by the server or not? In theory, I don't think other information should matter. But I'm open to exploring this more.

@Zenexer

The ask endpoint is only invoked if the client tries to establish a TLS handshake using a domain name it does not have a certificate for (and isn't itself a challenge handshake).

That seems to run contrary to what was said elsewhere in this thread: it seems that the ask endpoint is being hit prior to checking storage for the existence of a certificate. If that's not the case, there might still be a concurrency issue somewhere that needs debugging.

Ah, I think I see where the confusion is. What I said here is true, and doesn't preclude what I said earlier. 'Ask' is consulted before checking to see if storage has a certificate to satisfy the TLS handshake (I'm talking about normal handshakes here, not TLS-ALPN-01 challenge handshakes which use only a very specific, minimal code path). If 'ask' returns 200, storage is checked. If storage returns no cert, then an ACME challenge is initiated to obtain one. Does that make sense?

mholt commented 1 month ago

@Zenexer Sorry, one more thing:

Caddy at Hosting Provider A goes to Let's Encrypt and starts the challenge process.

Let's Encrypt attempts to complete the HTTP-01 challenge by making a request to http://example.com/.well-known/acme-challenge/something.

Caddy at Hosting Provider B receives this challenge request.

Caddy at Hosting Provider B receives a 200 response from its ask endpoint.

Step 13 is not (or should not be) the case. A request over HTTP does not invoke the ask endpoint because no certificate is required because there is no TLS handshake to complete. In other words, this part does not invoke or utilize on-demand TLS at all. If a plaintext HTTP request is in fact invoking on-demand TLS / your 'ask' endpoint, then there is a bug and we should open a new issue to discuss.

So, if my understanding is correct, your main concern is spammy log entries?

Zenexer commented 1 month ago

This should only be negatively affecting them (and I guess the CAs, which is why they rate limit them). If they initiate an ACME challenge for a domain pointed to you, you will see junk log entries at most. Or what wasted PKI resources are you referring to?

To be clear, the only reason I'm bringing up any of this is because it's made debugging this concurrency issue difficult. If everything worked perfectly as described all the time, it would just be a matter of ignoring the logs--the same as I was already doing.

It's only a problem because I can't tell which log entries and ask requests are a result of my Caddy instances trying to get certs (or encountering concurrency issues), versus which log entries are from someone else's lazy misuse of ACME.

By "wasting PKI resources," I'm referring to the fact that DV isn't free, especially at scale. There's an expectation that integrators should make a reasonable effort to ensure a domain actually points to them (and that an ACME challenge will succeed) prior to trying to get a cert.

I should have mentioned that this is what I was referring to that is a bit unconventional. Typically, users sign up with their domain name(s) they are going to use, then your service knows what the domains are. If setting DNS records is itself the act of "signing up" with your service, then it makes sense that you have to check DNS records. This is the only time I have heard of this being the case... and for a few reasons (complexity, reliability, etc) I don't recommend this... unless, perhaps, you do like what our Caddy homepage demo does, where a user can point a specific subdomain ("caddydemo" in our case) to your IP, and then no signup is required. This still prevents abuse because it only works for one specific identifier per registered domain, hence there's a relatively significant cost barrier to abuse it.

I think we're getting a bit caught up with my specific use case, which isn't too relevant here. Let's use the example I provided in a recent comment with two hosting providers.

To clarify once more, if someone else initiates an ACME challenge for a hostname that fails repeatedly, that doesn't rate limit you. It only rate limits their ACME account, not yours.

Correct. I'm not being rate limited and am not proposing that as a concern.

If I follow the instructions in the docs and stop performing DNS checks, then I become Hosting Provider A in my example, and I will get rate limited.

Ah, I think I see where the confusion is. What I said here is true, and doesn't preclude what I said earlier. 'Ask' is consulted before checking to see if storage has a certificate to satisfy the TLS handshake (I'm talking about normal handshakes here, not TLS-ALPN-01 challenge handshakes which use only a very specific, minimal code path). If 'ask' returns 200, storage is checked. If storage returns no cert, then an ACME challenge is initiated to obtain one. Does that make sense?

Yes, that was my initial impression. That design strongly disincentivizes ask endpoints from performing sufficient checks. That leads to an ecosystem problem as Caddy becomes more popular.

It invites Caddy users to put themselves in the position of Hosting Provider A above.

francislavoie commented 1 month ago

@aaomidi

There is no external database to check in this circumstance. There is no UI. There is no app. The database is DNS.

I just want to reiterate. This is not okay to do. The problem is that any bad actor can point a wildcard domain to your server and then infinitely make requests like a.badguy.com then b.badguy.com to infinity, making your server issue a cert per domain, resulting in a DDOS (rate limits getting hit, or storage exhaustion). That why you must have an allow-list, you cannot trust DNS because it's not under your control.

@Zenexer

By "wasting PKI resources," I'm referring to the fact that DV isn't free, especially at scale. There's an expectation that integrators should make a reasonable effort to ensure a domain actually points to them (and that an ACME challenge will succeed) prior to trying to get a cert.

Caddy's internal rate limiting should prevent that from being a concern. It slows itself down if too many attempts are happening.

If I follow the instructions in the docs and stop performing DNS checks, then I become Hosting Provider A in my example, and I will get rate limited.

No, because only a TLS handshake that actually reaches your server (meaning the DNS is already configured correctly) can trigger On-Demand TLS. Any domain which doesn't point to your server will not renew when the time comes because only a TLS handshake would trigger a renewal attempt.

I've never seen any evidence of the usecase you're describing being a real issue.

Zenexer commented 1 month ago

Step 13 is not (or should not be) the case. A request over HTTP does not invoke the ask endpoint because no certificate is required because there is no TLS handshake to complete. In other words, this part does not invoke or utilize on-demand TLS at all.

I don't understand. Caddy will check storage in this situation without querying the ask endpoint?

There's a misunderstanding or bug here somewhere, because if I go to http://example.com/.well-known/acme-challenge/foobar in this scenario, the error logs indicate that it's doing something.

Zenexer commented 1 month ago

No, because only a TLS handshake that actually reaches your server (meaning the DNS is already configured correctly) can trigger On-Demand TLS. Any domain which doesn't point to your server will not renew when the time comes because only a TLS handshake would trigger a renewal attempt.

Customers misconfigure DNS in such a way that it will hit both hosting providers all the time (e.g., adding Hosting Provider B's nameservers without removing A's). There are also a plethora of automated services out there that will continue making requests to the old IP address, because that's just what they do. Most are security and OSINT companies. Some are malicious actors.

In any case, Hosting Provider A is still going to see TLS handshakes for a long time to come.

mholt commented 1 month ago

@Zenexer

To be clear, the only reason I'm bringing up any of this is because it's made debugging this concurrency issue difficult. If everything worked perfectly as described all the time, it would just be a matter of ignoring the logs--the same as I was already doing.

That's understandable. There's a lot of moving parts (under the hood, at least) that we have to wade through.

It's only a problem because I can't tell which log entries and ask requests are a result of my Caddy instances trying to get certs (or encountering concurrency issues), versus which log entries are from someone else's lazy misuse of ACME.

CertMagic emits logs when it -- and not a third party -- initiates an ACME challenge. Are you referring to log entries for HTTP challenge resources specifically? How are the logs ambiguous? If CertMagic didn't initiate the ACME challenge you won't see log entries that indicate trying to solve challenge. So if you see other logs for ACME challenges but no "trying to solve challenge" logs, then it's a third-party.

By "wasting PKI resources," I'm referring to the fact that DV isn't free, especially at scale. There's an expectation that integrators should make a reasonable effort to ensure a domain actually points to them (and that an ACME challenge will succeed) prior to trying to get a cert.

I see... I suppose you might be using a loose definition of the term "PKI" in that you're talking about CA resources in general, because a failed ACME challenge doesn't allocate any PKI resources (at least, not publicly -- the website still generates a CSR I guess): no certificates, no precertificates, no CT logs, no revocations/CRLs, no OCSP staples, etc.

We've discussed this a lot in the past, including with Let's Encrypt themselves, and with their community, and the overall consensus seems to be that because DNS is a matter of perspective, doing the lookup yourself is seldom helpful. And we've seen from experience (with the lego project, before we diverted from it) that doing DNS lookups oneself got in the way more often than it helped.

I think we're getting a bit caught up with my specific use case, which isn't too relevant here. Let's use the example I provided in a recent comment with two hosting providers.

Fair; but I still care about helping you make your service functional. (We may need to discuss a sponsorship to really dig in and figure out how we can improve the situation for your infrastructure, as it's not a common case at all.) That said, the example with steps 1-14 doesn't seem quite right to me (see last comment).

Correct. I'm not being rate limited and am not proposing that as a concern.

If I follow the instructions in the docs and stop performing DNS checks, then I become Hosting Provider A in my example, and I will get rate limited.

Ah, I see. In that ca,se when a domain is pointed away from a server but the host still thinks it serves that domain, TLS handshakes naturally stop coming in because DNS lookups resolve somewhere else; and CertMagic's On-Demand TLS lets the certificate expire and then it gets deleted. It does not keep trying to renew the cert -- that only happens if TLS handshakes keep coming in (and ask returns 200).

Yes, that was my initial impression. That design strongly disincentivizes ask endpoints from performing sufficient checks. That leads to an ecosystem problem as Caddy becomes more popular.

It invites Caddy users to put themselves in the position of Hosting Provider A above.

Okay, I think I see what you mean with this. But the implementation of On-Demand TLS should account for this as it naturally lets certs expire that are no longer pointing clients at it.

Step 13 is not (or should not be) the case. A request over HTTP does not invoke the ask endpoint because no certificate is required because there is no TLS handshake to complete. In other words, this part does not invoke or utilize on-demand TLS at all.

I don't understand. Caddy will check storage in this situation without querying the ask endpoint?

It checks storage for a challenge token, not for a certificate. This flow has nothing to do with On-Demand TLS.

There's a misunderstanding or bug here somewhere, because if I go to http://example.com/.well-known/acme-challenge/foobar in this scenario, the error logs indicate that it's doing something.

Well, of course it'll do something 😃 What are the full logs related to such a request? It should be looking up challenge info to see if it has to solve it, and it should find that there isn't any, and respond to the request.

Customers misconfigure DNS in such a way that it will hit both hosting providers all the time (e.g., adding Hosting Provider B's nameservers without removing A's). There are also a plethora of automated services out there that will continue making requests to the old IP address, because that's just what they do. Most are security and OSINT companies. Some are malicious actors.

In any case, Hosting Provider A is still going to see TLS handshakes for a long time to come.

And this problem is orthogonal to Caddy/CertMagic, and it exists across the Web no matter what server or cert automation you're using. Ultimately the CA has to decide whether the DNS configuration authorizes a certificate for the server, mess or no mess.

Zenexer commented 1 month ago

We've discussed this a lot in the past, including with Let's Encrypt themselves, and with their community, and the overall consensus seems to be that because DNS is a matter of perspective, doing the lookup yourself is seldom helpful. And we've seen from experience (with the lego project, before we diverted from it) that doing DNS lookups oneself got in the way more often than it helped.

If I'm not mistaken, Let's Encrypt's rate limiting is much stricter for requests that fail DV--that is, you're likely to run into rate limiting if you're requesting a lot of certs and not doing DNS checks on your own.

It checks storage for a challenge token, not for a certificate. This flow has nothing to do with On-Demand TLS.

I've since confirmed this: the ask endpoint won't receive a request.

I think a self-contained example would be helpful here. It should go without saying, but to anyone coming across this discussion: this example does not use best practices; it's intended to demonstrate a particular issue with minimal code. Do not use this in production.

Caddyfile

``` { email invalid@invalid.invalid acme_ca https://acme-staging-v02.api.letsencrypt.org/directory default_sni invalid on_demand_tls { ask http://127.0.0.1:8002/ask } } http://127.0.0.1:8002 { log route /ask { @known query domain=known.example respond @known 200 respond 400 } } http:// { log } https:// { log tls { on_demand } } ```

docker-compose.yml

```yaml services: caddy: image: "library/caddy:latest" ports: - "127.0.0.1:8080:80" - "127.0.0.1:8443:443" - "127.0.0.1:8443:443/udp" volumes: - "./Caddyfile:/etc/caddy/Caddyfile" ```

In my use case, I'm getting a lot of requests to /.well-known/acme-challenge/* from various CAs and vulnerability scanners. Neither Caddy nor anything else in my infrastructure was trying to get a certificate, so Caddy doesn't have any challenge data.

Command:

% curl -sv --connect-to unknown.example:80:127.0.0.1:8080 http://unknown.example/.well-known/acme-challenge/foobar
* Connecting to hostname: 127.0.0.1
* Connecting to port: 8080
*   Trying 127.0.0.1:8080...
* Connected to 127.0.0.1 (127.0.0.1) port 8080
> GET /.well-known/acme-challenge/foobar HTTP/1.1
> Host: unknown.example
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 308 Permanent Redirect
< Connection: close
< Location: https://unknown.example/.well-known/acme-challenge/foobar
< Server: Caddy
< Date: Sat, 28 Sep 2024 16:15:44 GMT
< Content-Length: 0
<
* Closing connection

Caddy logs:

% docker compose logs -fn0 caddy
caddy-1  | {"level":"warn","ts":1727540193.833114,"logger":"http","msg":"looking up info for HTTP challenge","host":"unknown.example","remote_addr":"172.19.0.1:56906","user_agent":"curl/8.7.1","error":"no information found to solve challenge for identifier: unknown.example"}
caddy-1  | {"level":"info","ts":1727540193.8339462,"logger":"http.log.access","msg":"handled request","request":{"remote_ip":"172.19.0.1","remote_port":"56906","client_ip":"172.19.0.1","proto":"HTTP/1.1","method":"GET","host":"unknown.example","uri":"/.well-known/acme-challenge/foobar","headers":{"User-Agent":["curl/8.7.1"],"Accept":["*/*"]}},"bytes_read":0,"user_id":"","duration":0.002686333,"size":0,"status":308,"resp_headers":{"Server":["Caddy"],"Connection":["close"],"Location":["https://unknown.example/.well-known/acme-challenge/foobar"],"Content-Type":[]}}

For comparison, curl -sv --connect-to unknown.example:443:127.0.0.1:8443 https://unknown.example/ will show a query to the ask endpoint, as will curl -sv --connect-to known.example:443:127.0.0.1:8443 https://known.example/. The former will end with a TLS alert, while the latter will try to get a cert from LE's staging endpoint.

The most notable log message here is "error":"no information found to solve challenge for identifier: unknown.example". That's emitted even without the log directive. While trying to debug #303, I didn't really have any way of determining whether these were the result of concurrency issues, storage issues, or third-parties making bogus requests to /.well-known/acme-challenge/. It turned out to be mostly the latter.

As you pointed out, ask isn't consulted in this scenario--on-demand TLS doesn't come into play. I wasn't certain of that at the time.

In my case, there were a combination of issues:

My underlying storage was flaky. I was using NFS, which should be fine here, but I had incorrect mount options that were preventing locking from working correctly.
Because storage was flaky, a small percentage of these error messages were the result of legitimate ACME attempts, but Caddy just couldn't see the challenge token in storage yet.
Even with that resolved, #303 meant that locking wasn't working correctly.
Even with the fix for #303 applied, there was a lot of corrupt metadata JSON still sitting around.
Caddy was sending a lot of requests to the ask endpoint because certificates weren't being synchronized. There seemed to be a correlation between the no information found to solve challenge error messages and the ask endpoint requests. This was probably a coincidence.

I knew there were likely multiple issues, which is why debugging was so complicated.

I do think the PR here resolves #303, although it doesn't fix the already-broken metadata. My only complaint was that the design decisions surrounding ask made it difficult to gain introspection into Caddy's logic and determine what was actually happening. In an ideal world without any bugs, the behavior wouldn't really matter.

francislavoie commented 1 month ago

If I'm not mistaken, Let's Encrypt's rate limiting is much stricter for requests that fail DV--that is, you're likely to run into rate limiting if you're requesting a lot of certs and not doing DNS checks on your own.

Caddy was designed such that it shouldn't be an issue in practice, both because of the internal rate limiting, and the fallback to staging on first failure attempt. See https://caddyserver.com/docs/automatic-https#errors

I've since confirmed this: the ask endpoint won't receive a request. [...]

Thanks for the detailed writeup! I think that aligns with our view of it as well. I agree changes could probably be made for better introspection. What would you think would help in terms of logging etc?

bracketforward commented 1 month ago

Interesting discussion. I'd like to better understand the consequences of a world full of providers like Hosting Provider A, whereby they only check that their user added the domain and disregard misconfigured DNS, which it sounds like is the vast majority of Caddy users?

Mixing name servers of various providers ("multi-provider DNS"), both accidentally and intentionally, is common enough to warrant a solution for it; the data on its prevalence can be observed across registry zone files, or aggregation tools like DomainTools.com, DNS.Coffee, etc.

Let's assume most of those multi-provider occurrences are by users who have added the domains to each hosting provider that they use (e.g., Wix and Shopify). Perhaps they want to A/B test the two providers without realizing this is the wrong way to do it.

In this example, both Wix and Shopify should fail DV, but they won't realize that if all they're checking is whether a user added the domain to an account with them.

Not a big deal if it's rare, but it's not that rare.

Ultimately the CA has to decide whether the DNS configuration authorizes a certificate for the server, mess or no mess.

@mholt, at what point does the CA decide that too many are failing and that the provider should be taking proactive steps to reduce those failures? What are the consequences of that CA's decision on hosting providers?

Caddy was designed such that it shouldn't be an issue in practice, both because of the internal rate limiting, and the fallback to staging on first failure attempt. See https://caddyserver.com/docs/automatic-https#errors

@francislavoie, what if there are thousands of domains with this issue being visited daily? That's entirely plausible for a medium-sized provider.

Zenexer commented 1 month ago

What would you think would help in terms of logging etc?

Changing the log level for no information found to solve challenge for identifier to info or debug would be my best recommendation. Adding it to the docs would also help, since it's probably not clear what it means to anyone who isn't familiar with ACME.

I don't think it makes sense to query the ask endpoint prior to checking storage. Requests to /.well-known/acme-challenge/* go directly to storage anyway, so it has to be fast.

mholt commented 1 month ago

Catching up after the weekend...

If I'm not mistaken, Let's Encrypt's rate limiting is much stricter for requests that fail DV--that is, you're likely to run into rate limiting if you're requesting a lot of certs and not doing DNS checks on your own.

Somewhat. From Let's Encrypt Rate Limits:

There is a Failed Validation limit of 5 failures per account, per hostname, per hour.

Note that it's per account, per hostname, (and per hour). So, kind of, but if one person's domain fails to verify, that won't block others.

I've since confirmed this: the ask endpoint won't receive a request. ... I knew there were likely multiple issues, which is why debugging was so complicated. I do think the PR here resolves https://github.com/caddyserver/certmagic/issues/303, although it doesn't fix the already-broken metadata. My only complaint was that the design decisions surrounding ask made it difficult to gain introspection into Caddy's logic and determine what was actually happening. In an ideal world without any bugs, the behavior wouldn't really matter.

Thanks so much for investigating this! And thanks for the follow-up. I'll see if we can make things clearer.

@bracketforward Good questions.

at what point does the CA decide that too many are failing and that the provider should be taking proactive steps to reduce those failures? What are the consequences of that CA's decision on hosting providers?

I think Let's Encrypt's rate limits are a good example of this. They have many rate limits, and if you reach them, you could take steps to reduce those failures. On the other hand, they exist to prevent excessive use of resources, so hitting them isn't "bad" per-se, especially when the domain is truly misconfigured (i.e. trying again won't help). The only time it's not good is when it's a "false positive" like if DNS records have been set but haven't propagated yet (it's not really a "false positive" either, I just don't know a better term).

I've seen LE staff reach out to integrators that are having notable difficulty. But it's always been reasonable IMO.

My browser is crashing so I'll brb to finish this.

what if there are thousands of domains with this issue being visited daily? That's entirely plausible for a medium-sized provider.

The internal rate limits are pretty generous now, I think it's 1 per second.

The main thing is the CA sets their rate limits, we just want to avoid slamming CAs with thousands of requests per minute.

@Zenexer

Changing the log level for no information found to solve challenge for identifier to info or debug would be my best recommendation. Adding it to the docs would also help, since it's probably not clear what it means to anyone who isn't familiar with ACME.

That's not a bad idea. Maybe DEBUG.

I don't think it makes sense to query the ask endpoint prior to checking storage. Requests to /.well-known/acme-challenge/* go directly to storage anyway, so it has to be fast.

We have specifically had requests -- from large integrators -- to gate this behind the ask endpoint. It is fast, but it's expensive. The well-known lookups are a good point, though I haven't had complaints about that yet (other than, in a way, this thread, I suppose).

francislavoie commented 1 month ago

The internal rate limits are pretty generous now, I think it's 1 per second.

To be exact, 10 per 10 seconds, but effectively the same.

caddyserver / certmagic

Corrupted metadata JSON files caused by bug #297 #303