StackExchange / dnscontrol

Infrastructure as code for DNS!
https://dnscontrol.org/
MIT License
3.13k stars 397 forks source link

DnsProvider tries to replace NS records of another provider #2293

Closed daleki closed 1 year ago

daleki commented 1 year ago

In previous versions of dnscontrol we used a config like this for most of our domains:

D("kentik.net", REG_NONE,
    DefaultTTL('1h'),           // default ttl for records
    NAMESERVER_TTL('1h'),       // default ttl for nameservers
    DnsProvider(DNS_GCLOUD, 4), // grab 4 nameservers
    DnsProvider(DNS_NS1, 4),    // grab 4 nameservers

In recent versions (including latest v3.31.2) this now tries to replace NS records of another provider for some reason. Any idea why these records might be affected now?

        +preview | ******************** Domain: kentik.net
            +preview | 1 correction (gcloud)
            +preview | #1: ± MODIFY NS kentik.net: (dns1.p01.nsone.net. ttl=3600) -> (ns-cloud-d1.googledomains.com. ttl=3600)
            +preview | ± MODIFY NS kentik.net: (dns2.p01.nsone.net. ttl=3600) -> (ns-cloud-d2.googledomains.com. ttl=3600)
            +preview | ± MODIFY NS kentik.net: (dns3.p01.nsone.net. ttl=3600) -> (ns-cloud-d3.googledomains.com. ttl=3600)
            +preview | ± MODIFY NS kentik.net: (dns4.p01.nsone.net. ttl=3600) -> (ns-cloud-d4.googledomains.com. ttl=3600)
tlimoncelli commented 1 year ago

Can you help me reproduce this? I have a similar setup but I'm not seeing that problem.

Is this output for with --diff2? Do you get different results without --diff2?

Could you try an earlier release like https://github.com/StackExchange/dnscontrol/releases/tag/v3.29.1 ?

daleki commented 1 year ago

Thanks for the reply! The output above is without --diff2. With --diff2 I get a different result:

            +preview | --> RUN --no-cache dnscontrol --diff2 preview
            +preview | [INFO: Diff2 algorithm in use.]
            +preview | ******************** Domain: cloudhelix.com
            +preview | ******************** Domain: kentik.com
            +preview | 1 correction (gcloud)
            ..skip some lines here
            +preview | #1: - DELETE NS kentik.net dns1.p01.nsone.net. ttl=3600
            +preview | - DELETE NS kentik.net dns2.p01.nsone.net. ttl=3600
            +preview | - DELETE NS kentik.net dns3.p01.nsone.net. ttl=3600
            +preview | - DELETE NS kentik.net dns4.p01.nsone.net. ttl=3600

v3.29.1 yields same results with and without --diff2.

tlimoncelli commented 1 year ago

CC @costasd and @riyadhalnur for assistance

This is odd because nothing was intended to change regarding with NS handling recently. That's not to say things didn't change, or that some other change didn't have an unexpected side-effect. I'm just saying that at this point I haven't identified the problem.

Next step: Which was the last release that didn't have this problem? Binaries are available here: https://github.com/StackExchange/dnscontrol/tags

tlimoncelli commented 1 year ago

(also: If anyone else has a similar issue, please speak up. Does it involve NS1?)

daleki commented 1 year ago

Next step: Which was the last release that didn't have this problem? Binaries are available here: https://github.com/StackExchange/dnscontrol/tags

We were on the 3.19.0 release for a long time with no issues. Then we started getting the errors below which prompted the upgrade to v3.31.2:

            +preview | ----- Getting nameservers from: ns1
            +preview | provider code leaves trailing dot on nameserver
tlimoncelli commented 1 year ago

The errors happened even with 3.19.0? i.e. there's a possibility that NS1's API changed?

tlimoncelli commented 1 year ago

Give the tlim_b2293_ns1_nameservers branch a try.

git clone https://github.com/StackExchange/dnscontrol.git
cd dnscontrol
git checkout tlim_b2293_ns1_nameservers
go install

This will install a new binary in ~/bin. Give that a try.

daleki commented 1 year ago

The 3.19.0 error is a different one: provider code leaves trailing dot on nameserver, then it just exits. I'll try tlim_b2293_ns1_nameservers.

costasd commented 1 year ago

Hi,

from @daleki 's output (thanks!) looks like dnscontrol picks 4 nameservers instead of 4+4 - ending up changing them everytime. I'm feeling that's (relatively) where the bug lies, but haven't really verified anything.

Unfortunately the trailing dot bug that was fixed recently wont allow for a lot of bisecting here, at least on ns1's side.

I'm on a trip so it'll take a bit, but I'll try to replicate the setup (got access to gcloud and ns1) and see if I can debug it further.

costasd commented 1 year ago

so.. I created the following setup, with the relevant accounts & zones, set in NS1 and GCLOUD for example.com.

var REG_NONE = NewRegistrar('none');
var DNS_NS1 = NewDnsProvider('ns1');
var DNS_GCLOUD = NewDnsProvider('gcloud');

D("example.com", REG_NONE, 
    DefaultTTL('1h'),           // default ttl for records
    NAMESERVER_TTL('1h'),       // default ttl for nameservers
    DnsProvider(DNS_GCLOUD, 4), // grab 4 nameservers
    DnsProvider(DNS_NS1, 4),    // grab 4 nameservers
    A('@', '1.2.3.4')
);

And I don't seem able to reproduce it, with latest master:

 $ ../../oss/dnscontrol/dnscontrol  --diff2 preview --domains example.com
[INFO: Diff2 algorithm in use.]
******************** Domain: example.com
Done. 0 corrections.

$ ../../oss/dnscontrol/dnscontrol preview --domains example.com 
[INFO: Old diff algorithm in use. Please test --diff2 as it will be the default in releases after 2023-05-07. See https://github.com/StackExchange/dnscontrol/issues/2262]
******************** Domain: example.com
Done. 0 corrections.

is there anything missing in this setup in order to trigger the behavior?

costasd commented 1 year ago

same (no changes) with a quick build out of the v3.31.2 tag

daleki commented 1 year ago

Hi @costasd ! Great to see you here and thanks for taking a look. I can't think of anything missing other than state of already pushed dns records existing on providers backends. I tried a few more tests with 3.31.2 release and setting --diff2, but can't isolate the bug further at the moment than saying it seems related to ns1. When I set DnsProvider(DNS_NS1, 0) and try adding other providers like R53 I see expected output.

tlimoncelli commented 1 year ago

@daleki Maybe it is a problem at NS1? Try removing records and re-adding them. i.e. use DnsProvider(DNS_NS1, 0) and do a "push" to clear things out. Then DnsProvider(DNS_NS1, 4) and push again.

daleki commented 1 year ago

We fixed by manually updating state in NS1 via ui. We deleted all google NS records in all zones on NS1 and manually added NS1 ns records for all zones. Then we ran a normal dnscontrol push with the new diff2 option and got the desired result.

➜ dig +short kentik.com ns @dns1.p01.nsone.net.
dns1.p01.nsone.net.
dns2.p01.nsone.net.
dns3.p01.nsone.net.
dns4.p01.nsone.net.
ns-cloud-c1.googledomains.com.
ns-cloud-c2.googledomains.com.
ns-cloud-c3.googledomains.com.
ns-cloud-c4.googledomains.com.

Thanks for the help @costasd and @tlimoncelli !