bentasker / HomeAssistantAddons

30 stars 4 forks source link

HomeAssistant CoreDNS timeout settings #3

Open bentasker opened 2 years ago

bentasker commented 2 years ago

As detailed here some users might still experience delays/lag if their external connectivity drops.

It'll depend on what the upstream timeouts on their local DNS is - obviously if they're using a router, then it may not be configurable (and the router might be a little too dumb to know that it's upstream link is down).

Back within the sphere of HomeAssistant, I was wondering if we can tweak things there so mitigate for users affected by that.

Unfortunately, CoreDNS uses a DialTimeout of 30s (here).

Actually, worse than that, it dynamically adjusts, which probably goes a long way to explain the packet storms that HomeAssistant users have reported when Cloudflare was unreachable.

Unfortunately, timeout isn't a configurable - we'd need to make changes in the source and then recompile coredns with those. Far from ideal.

bentasker commented 2 years ago

The idea of having to rebuild coredns to change this is insane.

However, it is doable, and led me onto a much wider thought.

Actually, if push comes to shove, we've a fair bit of flexibility in terms of what's running on the system despite not having access to the underlying OS (that's achievable too, I've already popped the host from an add-on container, it's just not something I'm gonna productise).

We have access to the docker daemon, which means we can build and tag images to our heart's content. So rather than trying to patch things via exec we can actually go one further:

Then, when supervisor restarts the add-on, we'll be running our modified version. Obviously, after any update the process would need to be re-run.

It also means that an alternative fix for #1 would be to build an entirely new supervisor container with auto-update patched out and calls to CodeNotary in https://github.com/home-assistant/supervisor/blob/main/supervisor/utils/codenotary.py patched out.

That's very much a nuclear option though, we could just as easily push across a rewrite to that file so that exceptions CodeNotaryError, CodeNotaryUntrusted, PwnedError can never be thrown.

Having dug into it, it really is the most pointless "protection" - the code is checking it's own homework, so anyone with the ability to patch the code being protected also has the ability to patch out the checks. It's effectively just security theatre that impacts users most.

alexdelprete commented 2 years ago

Gentlemen (@fenichelar - @Zixim): probably there's a solution to our neverending battle for a proper HA DNS implementation: https://github.com/bentasker/HomeAssistantAddons/tree/master/core-dns-override

@bentasker you have no idea how long we have been waiting for an auto-patcher, we've been patching that configuration manually every time. They didn't even approve a slight modification like this one, let alone all the ones (#56 to #61) that @fenichelar tried to push. It's incredible they're not even able to acknowledge they have a huge DNS problem.

I'm going to install your addon immediately and I hope it solves the problem. Thank you so much. :)

bentasker commented 2 years ago

@alexdelprete Glad I could help.

Let me know if you encounter any issues with it - I've tried to be relatively conservative in terms of what it does, but if that causes an issue then there's another more heavy-handed approach I can take (which does come with the benefit that it won't revert when the container is restarted).

It's incredible they're not even able to acknowledge they have a huge DNS problem.

I've done what I can to feed a detailed analysis of the issue and underlying cause back upstream (https://github.com/home-assistant/plugin-dns/issues/ 64) but based on how lightly other PRs/Issues have been discarded, I'm not going to hold my breath. The idea that they've ignored an issue of this scale for this long is more than a bit bewildering to be honest.

I also have supervisor auto-updates on my radar. #1 implements an initial blocker to them, but it means that the Supervisor "Add on store" doesn't display after a reboot (it blocks trying to pull information by the looks of it). There are a couple of ways I can go about fixing that, I've not quite settled on which I prefer yet.

alexdelprete commented 2 years ago

there's another more heavy-handed approach I can take (which does come with the benefit that it won't revert when the container is restarted).

I read your comments about that, I think it's a little bit too much. :)

I hope this smarter approach works fine, will let you know...thanks a lot.

I'm not going to hold my breath. The idea that they've ignored an issue of this scale for this long is more than a bit bewildering to be honest.

I doubt they will do anything. In our last attempt, I even engaged the big boss, and he said they were not even willing to discuss it anymore. Case closed. I won't even bother to comment anymore, they wanted to suspend me because I was simply showing them how naive their approach was vs the problem: a big elephant in the room, and they don't see it.

fenichelar commented 2 years ago

I started work on a PR to create a config option that allowed completely disabling Cloudflare DOT, however, it became pretty clear that I was not going to be able to get this merged so I gave up.

FYI, I was able to solve all of my problems by NATing 1.1.1.1 and 1.0.0.1 to my private DNS resolver. The NAT is setup for both TCP/UDP port 53 and TCP port 853. TLS certificate verification fails because my DOT server obviously does not have a valid TLS certificate for cloudflare-dns.com, but I do not see any more packet storms, have much faster DNS resolution, and do not see any issues with internal domains. I have been running this setup for at least a month without issue.

bentasker commented 2 years ago

I started work on a PR to create a config option that allowed completely disabling Cloudflare DOT

I think it was one of your PRs I first saw actually - lots of work had gone into it, but then the devs came back with a bunch of requirements no-one sane could accept: "do all this work, and ensure we mark anyone using it as unsupported"

fails because my DOT server obviously does not have a valid TLS certificate for cloudflare-dns.com, but I do not see any more packet storms

That makes sense - the storms happen because of CoreDNS's handling at the TCP level - if the TCP handshake completes then you avoid that (even if SSL verification then fails). The funny thing is that your PR to move it back to UDP would have fixed those (even without redirection to another server).

have much faster DNS resolution

Yep, I've been quite struck by the difference to be honest. I've a few devices ewelink and Tuya devices, and I'd always assumed they were slow/laggy because distant API and didn't bother troubleshooting. Turns out the issue was very much closer to home.

I read your comments about that, I think it's a little bit too much. :)

Agreed, @alexdelprete - the main reason I've been experimenting with it is to hedge against ending up being marked as unsupported/unhealthy. Whilst I've no desire to seek support on something I've customised, I do want to still be able to update addons etc.

Zixim commented 2 years ago

been using the auto-patcher since it was created. It's the bees knees.

alexdelprete commented 2 years ago

been using the auto-patcher since it was created. It's the bees knees.

and you didn't tell us...:)

alexdelprete commented 2 years ago

Agreed, @alexdelprete - the main reason I've been experimenting with it is to hedge against ending up being marked as unsupported/unhealthy.

I perfectly understand, it's the plan B in case they ban your addon or enforce something that neutralizes your fix. You prepared the nuke. I don't think they will bother, they perfectly know there are people like us that in some way will find workarounds. :)

I'm still puzzled by the choice of using CoreDNS vs unbound or dnsmasq, that are much more reliable and widely used...because I can clearly see there are two issues here: a bad configuration + a bad dns resolver/forwarder.

One question about the addon, I'm just installing it: your addon checks every 60s to enforce correct configuration, but that doesn't imply a restart every 60s, right? Because that could raise some issues. Restart needed only if changes are applied, if I understood correctly. That is also if dns-override-template is used, right?

alexdelprete commented 2 years ago

I can confirm the addon is working perfectly. Great job. :)

bentasker commented 2 years ago

One question about the addon, I'm just installing it: your addon checks every 60s to enforce correct configuration, but that doesn't imply a restart every 60s, right? Because that could raise some issues. Restart needed only if changes are applied, if I understood correctly. That is also if dns-override-template is used, right?

Exactly correct - it'll only restart coredns if the config needed to be changed - if everything's the same it won't do the restart.

I'm still puzzled by the choice of using CoreDNS vs unbound or dnsmasq, that are much more reliable and widely used...because I can clearly see there are two issues here: a bad configuration + a bad dns resolver/forwarder.

CoreDNS is actually pretty good where it's commonly used - in Kubernetes clusters with decent connectivity - it's a poor choice (IMO) for embedding in an appliance. As you say, Unbound and dnsmasq are much better candidates.

alexdelprete commented 2 years ago

CoreDNS is actually pretty good where it's commonly used - in Kubernetes clusters with decent connectivity

unbound/dnsmasq work ok in any condition I've had the possibility to see...but I'm no expert of this new containerized world...old-school guy here. :)

thanks a lot for everything, I'm really grateful for your contribution to this problem.

can I contact you privately for an advice on an issue I'm having for another project? let me know what's your preferred way...thanks in advance.

bentasker commented 2 years ago

Yeah, Unbound/DNSMasq are pretty mature and resilient.

There's almost an inherent assumption in coreDNS that the person configuring it has reasonable control over the network - that contract's effectively broken in HomeAssistant because the config was dropped in without the network operator's knowledge/permission.

can I contact you privately for an advice on an issue I'm having for another project?

Sure, do you wanna DM me on the HA community forum and then I'll give you a better route of contact (saves me dropping details into a public GH issue)

alexdelprete commented 2 years ago

Sure Ben, I'll contact you on the forum first. Thanks a lot.