balena-os / meta-balena

A collection of Yocto layers used to build balenaOS images
https://www.balena.io/os
969 stars 115 forks source link

DNSMasq timeouts require a sensible value #881

Closed hedss closed 6 years ago

hedss commented 7 years ago

As part of the work for OnPrem, it was noticed in situations where a domain is not resolvable, operations could timeout due to blocking. See this thread https://www.flowdock.com/app/rulemotion/r-supervisor/threads/yW_H1qeMsuuXd0y2zsM-7gliLQ2 for more information. The solution there is to alter the resolver options so that the client drops the request after a particular length of time and number of attempts.

It's entirely possible that we will suffer similar issues on resinOS via DNSMasq, including within the Supervisor.

DNSMasq itself does not have user configurable option for the timeout length of a query, but it is set in src/config.h as the following definitions:

#define TIMEOUT 10

DNSMasq drops queries after 4x the TIMEOUT value, which is hardcoded in the get_new_frec() function in src/foward.c.

Therefore by default, any lookup that does not resolve takes a full 40 seconds before it's dropped should the upstream not resolve it. This is not ideal.

A sensible value for TIMEOUT needs to be configured.

willnewton commented 7 years ago

What is the problem that is trying to be solved here? Is there a testcase that reproduces the problem?

hedss commented 7 years ago

There have been noticeable timeouts on the Supervisor (see https://github.com/resin-io/resin-api/pull/589#issuecomment-335634114), and Pablo's interested in shortening the DNS timeouts.

The other way to go with this is to alter the resolver config (shorten timeouts and attempts).

CCing in @pcarranzav for more information, and @petrosagg.

agherzan commented 6 years ago

@pcarranzav @petrosagg @hedss What is the status of this? Do we have a conclusion on this? Do we want to shorten the timeout definition?

Can we have a test case in order to be able to reproduce this issue?

hedss commented 6 years ago

This actually was a symptom of the synchronicity of DNS in libuv (see the previously mentioned issue in the comment above). The hangs in the Supervisor due to this were actually fixed by https://github.com/resin-io/resin-supervisor/pull/500 which tunneled Mixpanel requests through the API. This therefore can be closed as it's no longer relevant.