StackExchange / dnscontrol

Infrastructure as code for DNS!
https://dnscontrol.org/
MIT License
3.1k stars 391 forks source link

INWX: support changing from null MX to non-null MX #2800

Open synotna opened 8 months ago

synotna commented 8 months ago

Describe the bug With INWX, when changing from null MX to non-null MX records, the ordering matters because INWX enforces that you cannot add further MX records while you have a null MX record

dnscontrol does not seem to handle this, leading to try to add non-null MX records before the null MX record is modified

To Reproduce Steps to reproduce the behavior:

  1. Have an INWX domain that has null MX
  2. Change to having non-null MX records e.g. the standard set for Google
  3. The additions are attempted before the modification of the existing null MX record, causing them to fail

1: + CREATE MX example.org 10 alt3.aspmx.l.google.com. ttl=3600

FAILURE! (2308) Data management policy violation

2: + CREATE MX example.org 10 alt4.aspmx.l.google.com. ttl=3600

FAILURE! (2308) Data management policy violation

3: + CREATE MX example.org 5 alt1.aspmx.l.google.com. ttl=3600

FAILURE! (2308) Data management policy violation

4: + CREATE MX example.org 5 alt2.aspmx.l.google.com. ttl=3600

FAILURE! (2308) Data management policy violation ...

6: ± MODIFY MX example.org: (0 . ttl=3600) -> (1 aspmx.l.google.com. ttl=3600)

SUCCESS!

Expected behavior The existing null MX record should be modified first so that the subsequent additions do not fail

DNS Provider INWX

tlimoncelli commented 8 months ago

CC @blackshadev (author of dnsgraph, which decides the ordering of updates) CC @patschi (INWX maintainer)

blackshadev commented 8 months ago

I am interested to know if any other provider has the same issue. I do not (yet) have any exceptions in the ordering flow for any provider. And it seems weird to me to make an exception since technically you can depent on empty records and the diff shown below should work in most systems.

I don't know of the INWX maintainer (@patschi) agrees, but it seems INWX specific and thus is should be best handled as an exception in the provider. My suggestion would be to detect the case in the provider : modify-ing an empty MX record. And treat is as a Delete + Create. The TransIP also have some weird edgecases were I ended up deleting and recreating the record instead.

tlimoncelli commented 8 months ago

On one hand....it does seem prudent that an NullMX should have a different weight than regular MX.

On the other hand... how often will this happen? Is there a sufficient workaround? (The workaround would be to delete all MX records in one push then add the Null MX in a second push).

tlimoncelli commented 8 months ago

I'd like to see if its possible to create an integration test that detects this situation.

I created this: https://github.com/StackExchange/dnscontrol/pull/2801

Could someone with an INWX account try running the integration tests from that branch?

Tom

P.S. The providers I do have creds for all passed the test: https://github.com/StackExchange/dnscontrol/actions/runs/7559961364/job/20584943571?pr=2801

patschi commented 8 months ago

I'd like to see if its possible to create an integration test that detects this situation.

Seems to work (the test itself at least):

$ go test -v -verbose -timeout 0 -provider INWX | grep -i fail
--- FAIL: TestDNSProviders (130.29s)
    --- FAIL: TestDNSProviders/dnscontrol1.com (130.09s)
        --- FAIL: TestDNSProviders/dnscontrol1.com/16:NullMX:unnull (0.12s)
        --- FAIL: TestDNSProviders/dnscontrol1.com/17:NullMXApex:unnull (1.27s)
        --- FAIL: TestDNSProviders/dnscontrol1.com/19:complex_TXT:TXT_with_1_dq-1interior (0.27s)
        --- FAIL: TestDNSProviders/dnscontrol1.com/20:TXT_backslashes:TXT_with_backslashs (0.41s)
FAIL
FAIL    github.com/StackExchange/dnscontrol/v4/integrationTest  130.876s

TRANSIP seems to fail the test as well: https://github.com/StackExchange/dnscontrol/actions/runs/7559961364/job/20585007084?pr=2801

If we prefer having some sort of workaround in INWX provider itself, I can try to work something out.

@tlimoncelli Is the CI running basically the integrationTest just like when running them locally? I could then create a new account with INWX with one test domain and add it to the CI pipeline. So that we have this covered as well.

tlimoncelli commented 8 months ago

@patschi Yes, the CI is running the same integrationTest. If you'd like to send me creds, please follow these instructions: https://docs.dnscontrol.org/developer-info/byo-secrets (this is a new doc; if you see something that needs correcting please let me know!)

tlimoncelli commented 8 months ago

Since TRANSIP also fails the test (link) I lean towards fixing this in dnsgraph. If 2 have this issue, that's a trend.

I'm concerned (but not certain!) that fixing it at the dnsgraph level might not help INWX. INWX uses diff.NewCompat(dc).IncrementalDiff(foundRecords) which reorders the changes randomly. Would you be interested in converting to diff2? Looking at the code... the conversion looks relatively straightforward.

patschi commented 8 months ago

Looks like almost all providers seem to use the same? https://github.com/search?q=repo%3AStackExchange%2Fdnscontrol%20%22diff.NewCompat(dc).IncrementalDiff%22&type=code

Shouldn't all providers then be adjusted to the new diff2 code to be consistent? (Was looking for some example code how I can convert it)

tlimoncelli commented 8 months ago

@blackshadev hmm... this is interesting!

The BIND provider outputs this message:

https://github.com/StackExchange/dnscontrol/actions/runs/7559961364/job/20585004068?pr=2801

PASS integrationTest.TestDNSProviders/example.com/17:NullMXApex:renull (0.00s)
  2024/01/17 18:12:10 Warning: Found unresolved records [example.com example.com www.example.com].
  This can indicate a circular dependency, please ensure all targets from given records exist and no circular dependencies exist in the changeset. These unresolved records are still added as changes and pushed to the provider, but will cause issues if and when the provider checks the changes.
  For more information and how to disable the reordering please consolidate our documentation at [https://docs.***.org/developer-info/ordering](https://docs.%2A%2A%2A.org/developer-info/ordering)
      integration_test.go:247: 
          - DELETE example.com MX 8 example.com. ttl=300
          - DELETE example.com A 1.2.3.4 ttl=300
          ± MODIFY example.com MX (4 www.example.com. ttl=300) -> (0 . ttl=300)
          - DELETE www.example.com A 1.2.3.8 ttl=300
  WRITING ZONEFILE: zones/example.com.zone
          --- PASS: TestDNSProviders/example.com/17:NullMXApex:renull (0.00s)

I'm seeing the same message in NAMEDOTCOM, HEXONET, HEDNS, and then I stopped looking.

tlimoncelli commented 8 months ago

Looks like almost all providers seem to use the same? https://github.com/search?q=repo%3AStackExchange%2Fdnscontrol%20%22diff.NewCompat(dc).IncrementalDiff%22&type=code

Shouldn't all providers then be adjusted to the new diff2 code to be consistent? (Was looking for some example code how I can convert it)

The diff.NewCompat() call is a bandaid. I wrote that so that I could get rid of the old diff code and not have to maintain the old and new code in parallel. I would definitely like to see all providers move to the diff2 functions. About half of them have converted:

$ grep diff2.By providers/*/*.go | wc -l
      22
$ grep diff.NewCompat providers/*/*.go | wc -l
      23

I think INWX would use diff2.ByRecord(). Take a look at providers/bunnydns/records.go for a relatively clean example.

blackshadev commented 8 months ago

Since TRANSIP also fails the test (link) I lean towards fixing this in dnsgraph. If 2 have this issue, that's a trend.

I'm concerned (but not certain!) that fixing it at the dnsgraph level might not help INWX. INWX uses diff.NewCompat(dc).IncrementalDiff(foundRecords) which reorders the changes randomly. Would you be interested in converting to diff2? Looking at the code... the conversion looks relatively straightforward.

This is a different "issue" all together, one I cannot fix: TransIP Will not allow an empty MX record. So the whole test setup doesn't even work for TransIP. I suspect most providers will not accept an empty MX record and thus fail the test.

@blackshadev hmm... this is interesting!

The BIND provider outputs this message:

https://github.com/StackExchange/dnscontrol/actions/runs/7559961364/job/20585004068?pr=2801

PASS integrationTest.TestDNSProviders/example.com/17:NullMXApex:renull (0.00s)
  2024/01/17 18:12:10 Warning: Found unresolved records [example.com example.com www.example.com].
  This can indicate a circular dependency, please ensure all targets from given records exist and no circular dependencies exist in the changeset. These unresolved records are still added as changes and pushed to the provider, but will cause issues if and when the provider checks the changes.
  For more information and how to disable the reordering please consolidate our documentation at [https://docs.***.org/developer-info/ordering](https://docs.%2A%2A%2A.org/developer-info/ordering)
      integration_test.go:247: 
          - DELETE example.com MX 8 example.com. ttl=300
          - DELETE example.com A 1.2.3.4 ttl=300
          ± MODIFY example.com MX (4 www.example.com. ttl=300) -> (0 . ttl=300)
          - DELETE www.example.com A 1.2.3.8 ttl=300
  WRITING ZONEFILE: zones/example.com.zone
          --- PASS: TestDNSProviders/example.com/17:NullMXApex:renull (0.00s)

I'm seeing the same message in NAMEDOTCOM, HEXONET, HEDNS, and then I stopped looking.

Hmm very weird, I hope I can take a crack at this issue upcoming friday. It is weird, there shouldn't be a difference in the behavior of providers, but for some reason there is?

tlimoncelli commented 8 months ago

About TransIP: Ooops! My bad. Yes, TransIP is a non-issue as NullMX is not processed.

blackshadev commented 8 months ago

@tlimoncelli I am unable to reproduce your issue with the integration test on the BIND provider:

~ go test -v -verbose -provider BIND -start 17 -end 17                    

=== RUN   TestDNSProviders
=== RUN   TestDNSProviders/example.com
=== RUN   TestDNSProviders/example.com/Clean_Slate:Empty
    integration_test.go:247: 
        - DELETE example.com NS ns1.example.com. ttl=300
        - DELETE example.com NS ns2.example.com. ttl=300
WRITING ZONEFILE: zones/example.com.zone
=== RUN   TestDNSProviders/example.com/17:NullMXApex:create
    integration_test.go:247: 
        + CREATE example.com A 1.2.3.2 ttl=300
        + CREATE example.com MX 0 . ttl=300
        + CREATE www.example.com A 1.2.3.8 ttl=300
WRITING ZONEFILE: zones/example.com.zone
=== RUN   TestDNSProviders/example.com/17:NullMXApex:unnull
    integration_test.go:247: 
        + CREATE example.com MX 8 www.example.com. ttl=300
        ± MODIFY example.com MX (0 . ttl=300) -> (2 example.com. ttl=300)
WRITING ZONEFILE: zones/example.com.zone
=== RUN   TestDNSProviders/example.com/17:NullMXApex:renull
    integration_test.go:247: 
        ± MODIFY example.com MX (2 example.com. ttl=300) -> (0 . ttl=300)
        - DELETE example.com MX 8 www.example.com. ttl=300
WRITING ZONEFILE: zones/example.com.zone
--- PASS: TestDNSProviders (0.02s)
    --- PASS: TestDNSProviders/example.com (0.02s)
        --- PASS: TestDNSProviders/example.com/Clean_Slate:Empty (0.00s)
        --- PASS: TestDNSProviders/example.com/17:NullMXApex:create (0.01s)
        --- PASS: TestDNSProviders/example.com/17:NullMXApex:unnull (0.00s)
        --- PASS: TestDNSProviders/example.com/17:NullMXApex:renull (0.00s)
=== RUN   TestDualProviders
    integration_test.go:369: Clearing everything
    integration_test.go:363: #1:
        - DELETE example.com A 1.2.3.2 ttl=300
        - DELETE example.com MX 0 . ttl=300
        - DELETE www.example.com A 1.2.3.8 ttl=300
WRITING ZONEFILE: zones/example.com.zone
    integration_test.go:376: Adding nameservers from another provider
    integration_test.go:363: #1:
        + CREATE example.com NS ns1.example.com. ttl=300
        + CREATE example.com NS ns2.example.com. ttl=300
WRITING ZONEFILE: zones/example.com.zone
    integration_test.go:379: Running again to ensure stability
--- PASS: TestDualProviders (0.00s)
=== RUN   TestNameserverDots
=== RUN   TestNameserverDots/No_trailing_dot_in_nameserver
--- PASS: TestNameserverDots (0.00s)
    --- PASS: TestNameserverDots/No_trailing_dot_in_nameserver (0.00s)
PASS
ok      github.com/StackExchange/dnscontrol/v4/integrationTest  0.025s

I also ran it without the start parameter. But everything runs fine without warning.

What did you do differently? I am on the main branch commit d926e454d08eb642d99c8710c63cab749d55ff87

tlimoncelli commented 8 months ago

Welp, now I can't reproduce it either. I even re-ran it in GHA: https://github.com/StackExchange/dnscontrol/actions/runs/7586343357/job/20664596836

:confused emoji:

Ok, let's say that was a fluke.

That said, I'm still interested in seeing if the original bug can be solved at the dnsgraph level. It is similar to the fact that when converting an A record to a CNAME yet a challenge because it is different.

blackshadev commented 8 months ago

Wel given that it works as expected on BIND. How do we expect it to work any other way on INWX any differently?

If @patschi uses the diff2 algorithm DNSGraph should sort things out right? If not, we need a way for me to debug and run some test suite.

For now: My understanding is without using diff2 , it will not use DNSGraph . So lets do that first and check for the result afterward. Or am I missing the point completely?

tlimoncelli commented 8 months ago

I think the next step should be to convert INWX to use diff2. If that fixes the NullMX problem, we've killed two birds with one stone.

If @patschi is interested in doing the conversion, that'd be great. If you want to add a test account (https://docs.dnscontrol.org/developer-info/byo-secrets) it would enable me to help out. (Don't let the long doc fool you... the actual work to set this up is rather small.)

If converting to diff2 doesn't work... then we'll loop back to @blackshadev with a repro case. :)

tlimoncelli commented 8 months ago

@tlimoncelli Is the CI running basically the integrationTest just like when running them locally? I could then create a new account with INWX with one test domain and add it to the CI pipeline. So that we have this covered as well.

@patschi To securely send the credentials for use with the project, use this link: https://transfer.secretoverflow.com/u/tlimoncelli (I should be able to get to it in the next week)