bosh-dns hides recursor errors

friegger commented 3 years ago

The ForwardRequest handler converts any recursor error into an NXDOMAIN response.

Since the requests are UDP based packets sometimes get lost which for example results in an I/O timeout error, which then is converted into NXDOMAIN. The application consequently has no chance to handle this properly as it now assumes that the domain does not exist. DNS resolution becomes unreliable flapping between successful resolution and NXDOMAIN.

The metrics also don't contain this information. It is just a regular NXDOMAIN response, no error. The error logs contain the information of course.

Desired Behavior

Not sure what would be the correct answer conforming to the DNS spec, but it would seem, that if the recursor times out, that bosh-dns should also just timeout to give the application the chance to act accordingly, e.g. by retrying.

To make bosh-dns more resillient in such scenarios, it could also do retries after timing out, in most situations the next request will be successful.

@mrosecrance What do you think?

cf-gitbot commented 3 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/176085593

The labels on this github issue will be updated when the story is started.

beyhan commented 3 years ago

This commit changed the behaviour from SERVFAIL to NXDOMAIN. It also contains information, why the change has been done.

In case of recursor timeout I'm not sure, whether bosh-dns can end the request with a timeout, because there is a different timeout defined between the application client and bosh-dns, than the timeout defined by bosh-dns for recursion.

friegger commented 3 years ago

For lost UDP packages the behavior with SERVFAIL probably was the better option, since the clients would then contact the next server, which would again be the IaaS DNS, which then would probably succeed. For other scenarios returning NXDOMAIN might be fine.

Without changing the return code retrying might be a good option to become more resilient here. Providing the config option recursor_retries together with the existing option recursor_timeout would enable this without changing the default behavior.

mrosecrance commented 3 years ago

To clarify, what we're trying to solve for is when only one recursor has the answer. In the case of a udp package loss to this particular server, NXDOMAIN will cause a fail over to the other recursors but in this particular case, they can't resolve the domain. My understanding is that SERVFAIL also tries a different recursor however so you're proposing a new way to retry in effect. In cases of UDP packet loss is there a specific error you see in the logs? Potentially we could toggle on error type here?

friegger commented 3 years ago

Yes, exactly, there is a single recursor, if it times out, bosh-dns returns NXDOMAIN. And you're right, if we would have more than one, then it would fail over to another one based on the recursor_selection. This is something that we've tried by the way. We specifically configured the same IaaS DNS as e.g. five different recursors to work around this, but unfortunately it also fails over for domains that really don't exist and such putting unnecessary load on the IaaS DNS. Here I am unsure, whether failing over for NXDOMAIN is the right thing to do.

In our scenario we rely on the resolv.conf, which just includes bosh-dns and the IaaS DNS. My comment regarding SERVFAIL related to the idea, that if the recursion made by bosh-dns times out, we would return SERVFAIL to the client, which then would decide to ask the next DNS in resolv.conf which again would be the IaaS DNS. With that we would get one retry. Here I wonder whether it would also do this if the domain really does not exist. Apparently this was the behavior in an earlier version which has deliberately be changed, as it had put too much load on the IaaS DNS in some scenarios.

The error we see is:

[ForwardHandler] 2021-01-22T06:01:52.910841000Z ERROR - error recursing for <domain_name> to "<recursor_ip>:53": read udp <vm_ip>:45448-><recursor_ip>:53: i/o timeout

and it comes from: https://github.com/cloudfoundry/bosh-dns-release/blob/0341c9f695308196697be58ccb93a0ddb20e43f8/src/bosh-dns/dns/server/handlers/forward_handler.go#L66

The idea for a retry mechanism would be to retry for this specific error or errors that don't have an exchangeAnswer.

bosh-admin-bot commented 2 years ago

This issue was marked as Stale because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale label or comment below.

bosh-admin-bot commented 2 years ago

This issue was closed because it has been labeled Stale for 7 days without subsequent activity. Feel free to re-open this issue at any time by commenting below.

cloudfoundry / bosh-dns-release

bosh-dns hides recursor errors #74

Desired Behavior