cloudfoundry / bosh-dns-release

BOSH DNS release
Apache License 2.0
18 stars 36 forks source link

bosh-dns recursor retry #75

Closed HappyTobi closed 3 years ago

HappyTobi commented 3 years ago

Hi all,

I just implemented the retry mechanism for the ServeDNS requests. The matching issue for that pr was #74 The issue was that it could happen that the client.Exchange call gets en error because of an timeout or any other related network issue.

So with the new implementation it's possible to setup and configure a retry_count with the new recursor_retry_count property at the bosh-dns configuration. With that retry a failed response will be triggered again till the counter was reached.

@mrosecrance What do you think about the implementation? Can you give me some feedback?

I've also tested the new bosh-dns build on a running instance and the log of the fix will look like:

[FailoverRecursor] 2021-01-28T17:26:48.296952065Z DEBUG - dns request error received NXDOMAIN for foo.bar.xyz. from upstream (recursor: XXX.XX.XXX.XX:53) no retry - XXX.XX.XXX.XX:53

and for a network / i/o timeout:

[ForwardHandler] 2021-01-28T17:26:541327202Z ERROR - error recursing for tacoda.net. to "XXX.XX.XXX.XX:53": read udp XX.X.X.X:47601->XXX.XX.XXX.XX: i/o timeout
[FailoverRecursor] 2021-01-28T17:26:541327202Z ERROR - dns request network error read udp XX.X.X.X:47601->XXX.XX.XXX.XX:53: i/o timeout retry [1/3] for recoursor XXX.XX.XXX.XX53
[ForwardHandler] 2021-01-28T17:26:53.541652001Z ERROR - error recursing for tacoda.net. to "XXX.XX.XXX.XX:53": read udp XX.X.X.X:47601->XXX.XX.XXX.XX: i/o timeout
[FailoverRecursor] 2021-01-28T17:26:53.541687501Z ERROR - dns request network error read udp XX.X.X.X:47601->XXX.XX.XXX.XX:53: i/o timeout retry [2/3] for recoursor XXX.XX.XXX.XX53
[ForwardHandler] 2021-01-28T17:26:542092900Z ERROR - error recursing for tacoda.net. to "XXX.XX.XXX.XX:53": read udp XX.X.X.X:47601->XXX.XX.XXX.XX: i/o timeout
[FailoverRecursor] 2021-01-28T17:26:542092900Z ERROR - dns request network error read udp XX.X.X.X:47601->XXX.XX.XXX.XX:53: i/o timeout retry [3/3] for recoursor XXX.XX.XXX.XX53
linux-foundation-easycla[bot] commented 3 years ago

CLA Signed

The committers are authorized under a signed CLA.

cf-gitbot commented 3 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/176681050

The labels on this github issue will be updated when the story is started.

beyhan commented 3 years ago

@HappyTobi thank you for the update. I reviewed the latest version and it looks good for me.

FlorianNachtigall commented 3 years ago

In order to make the recursor_retry_count property configurable from the outside, we created a PR including it in the bosh-dns job spec section of the release: https://github.com/HappyTobi/bosh-dns-release/pull/1

@HappyTobi if you agree, feel free to merge it and thereby include it in this PR

HappyTobi commented 3 years ago

Hi @FlorianNachtigall I will check that, and give you some feedback if needed, else I will merge it into my pr.

HappyTobi commented 3 years ago

We're deviating from my understanding of this doc now with a retry on the same recursor. I think that's probably fine as one advantage I can see is that it will help call out down recursors vs just recursors that don't have the answer.

Yes, there is no part at the rfc that matches 100% so we have to go some small things different here.

Think the implementation was fine at this point, because we are using UDP and have a fast retransmission for the lost packages (thats the most important thing).

The retry is also not the "default" way... so in most of the cases the retry will never be called so we don't loose any time here and didn't add to much complexity.

linux-foundation-easycla[bot] commented 3 years ago

CLA Signed

The committers are authorized under a signed CLA.

mrosecrance commented 3 years ago

Sam and I looked at the changes and it looks good. Currently the pipeline is broken due to a go 1.16 bump so merging is blocked on that being fixed.

friegger commented 3 years ago

We were wondering wether recursor_retry_count is giving away the right meaning or whether recursor_max_retries would be more appropriate?

jpalermo commented 3 years ago

Pipeline is fixed and we can merge this now. Do you want us to wait for a change to recursor_retry_count? Or is it ready to go as it is?

HappyTobi commented 3 years ago

@jpalermo I will rename it and push it again.

FlorianNachtigall commented 3 years ago

@HappyTobi We already renamed the property for our internal fork. I opened a PR with the respective changes: https://github.com/HappyTobi/bosh-dns-release/pull/2/

However, we were discussing whether it makes more sense to wait with the merge until we have collected some experience in production and round off the feature with our learnings.

To give some examples what those could be: Until now we've realized that it could be helpful to include the domain name or some alias uuid in the logs to identify and relate multiple dns resolution retries. Further, our operators found "retry [0/2]" a bit confusing and would have preferred something more like "try [1/3]".

What do you think @jpalermo & @mrosecrance?

HappyTobi commented 3 years ago

Hi @jpalermo & @mrosecrance & @FlorianNachtigall

from my side it's fine now.

I fixed some typos and changed the retry output complete because to switch from [0/2] to [1/3] in the output also doesn't represent whats really happen because the user set max_retry_count to 2 and see something like [1/3] in the output .

The new output for max_retry_count = 2 looks like:

dns request network error lookup : i/o timeout retry [1/1] - request count [2] for recursor retry

Regards

mrosecrance commented 3 years ago

I think the last open thread on this is whether log.Debug(logTag, fmt.Sprintf("dns request error %v no retry - %v\n", err, recursor)) should be switched to INFO or WARN but I don't have a strong opinion on that.