desec-io / desec-stack

Backbone of the deSEC Free Secure DNS Hosting Service
https://desec.io/
MIT License
380 stars 48 forks source link

Inconsistent DNS query results #869

Closed Atemu closed 7 months ago

Atemu commented 8 months ago

Today, I just cannot do DNS-01 ACME challenges reliably. It always fails saying that there was NXDOMAIN on the challenge domain eventhough lego has waited until the challenge record showed up in its DNS queries. I suspect this might be related to replication?

I let a while true; do dig @1.1.1.1 _acme-challenge.... TXT ; sleep 1 ; done run on the side and noticed something really odd:

  1. When the challenge record appears, it's usually gone the next second's query
  2. These short appearances can appear even while lego continues to query (it hasn't received a response where the record is present)
  3. The challenge record can re-appear even minutes after lego stopped the challenge (also just for one query, gone the next second). I typed out this report since the last challenge ran but I'm still sometimes getting ACME challenge records back every couple dozen seconds

Something's not right here..

peterthomassen commented 8 months ago

Indeed, we have been experiencing replication issues related to instabilities of the nameserver software we're using on our secondary servers (context: https://talk.desec.io/t/ns1-desec-io-replication-issues/804/6).

We have identified a solution, which is running in test mode on ams-1.a.desec.io (IPv4 only). Feel free to do tests with this nameserver and report back here.

We are expecting to deploy this into production incrementally, starting tomorrow.

We're very sorry for this!

Atemu commented 8 months ago

Thanks for the quick reply! Good to know I'm not going insane and there's an actual issue.

I'm not sure I can really test this given that the issue is on Let'sEncrypt's side and they query the SOA I assume.

(I'm somehow starting to question whether replication is solving more issues than it's causing...)

Atemu commented 7 months ago

Is there a nameserver I could poll that will always be the last to have the record propagated to so that I can ensure ACME will get NOERROR on the challenge record when it queries the SOA?

peterthomassen commented 7 months ago

Nopes, there is no such logic, unfortunately. The order in which secondary servers pull updates is not deterministic (or rather, the factors are not fully known, including network routing etc., so it's hard to say).

We're planning to implement an API for replication observability though. We might add a compact way for figuring out what the oldest deployed serial is, which I believe would give you what you need. You can track progress of this at https://github.com/desec-io/desec-stack/pull/852. However, we're currently working on more important replication improvements, so that work on this PR is delayed by a bit -- and once the replication work has finished, you might no longer need it ;-)

I'm realizing that this is an issue for https://github.com/desec-io/desec-ns, so I'm closing it here.

Atemu commented 7 months ago

Thanks.

For reference in case you changed anything, I was able to complete challenges some of the time earlier today though it was still inconsistent.