Closed HappyTobi closed 3 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/176681050
The labels on this github issue will be updated when the story is started.
@HappyTobi thank you for the update. I reviewed the latest version and it looks good for me.
In order to make the recursor_retry_count
property configurable from the outside, we created a PR including it in the bosh-dns job spec section of the release: https://github.com/HappyTobi/bosh-dns-release/pull/1
@HappyTobi if you agree, feel free to merge it and thereby include it in this PR
Hi @FlorianNachtigall I will check that, and give you some feedback if needed, else I will merge it into my pr.
We're deviating from my understanding of this doc now with a retry on the same recursor. I think that's probably fine as one advantage I can see is that it will help call out down recursors vs just recursors that don't have the answer.
Yes, there is no part at the rfc that matches 100% so we have to go some small things different here.
Think the implementation was fine at this point, because we are using UDP and have a fast retransmission for the lost packages (thats the most important thing).
The retry is also not the "default" way... so in most of the cases the retry will never be called so we don't loose any time here and didn't add to much complexity.
The committers are authorized under a signed CLA.
Sam and I looked at the changes and it looks good. Currently the pipeline is broken due to a go 1.16 bump so merging is blocked on that being fixed.
We were wondering wether recursor_retry_count
is giving away the right meaning or whether recursor_max_retries
would be more appropriate?
Pipeline is fixed and we can merge this now. Do you want us to wait for a change to recursor_retry_count
? Or is it ready to go as it is?
@jpalermo I will rename it and push it again.
@HappyTobi We already renamed the property for our internal fork. I opened a PR with the respective changes: https://github.com/HappyTobi/bosh-dns-release/pull/2/
However, we were discussing whether it makes more sense to wait with the merge until we have collected some experience in production and round off the feature with our learnings.
To give some examples what those could be: Until now we've realized that it could be helpful to include the domain name or some alias uuid in the logs to identify and relate multiple dns resolution retries. Further, our operators found "retry [0/2]" a bit confusing and would have preferred something more like "try [1/3]".
What do you think @jpalermo & @mrosecrance?
Hi @jpalermo & @mrosecrance & @FlorianNachtigall
from my side it's fine now.
I fixed some typos and changed the retry
output complete because to switch from [0/2]
to [1/3]
in the output also doesn't represent whats really happen because the user set max_retry_count
to 2
and see something like [1/3]
in the output .
The new output for max_retry_count = 2
looks like:
dns request network error lookup : i/o timeout retry [1/1] - request count [2] for recursor retry
Regards
I think the last open thread on this is whether log.Debug(logTag, fmt.Sprintf("dns request error %v no retry - %v\n", err, recursor))
should be switched to INFO or WARN but I don't have a strong opinion on that.
Hi all,
I just implemented the retry mechanism for the
ServeDNS
requests. The matching issue for that pr was #74 The issue was that it could happen that theclient.Exchange
call gets en error because of an timeout or any other related network issue.So with the new implementation it's possible to setup and configure a
retry_count
with the newrecursor_retry_count
property at the bosh-dns configuration. With thatretry
a failed response will be triggered again till the counter was reached.@mrosecrance What do you think about the implementation? Can you give me some feedback?
I've also tested the new bosh-dns build on a running instance and the log of the fix will look like:
and for a network / i/o timeout: