freifunk-gluon / gluon

a modular framework for creating OpenWrt-based firmwares for wireless mesh nodes
https://gluon.readthedocs.io
Other
553 stars 325 forks source link

gluon-core: babel: broken DNS #1500

Closed blocktrron closed 4 years ago

blocktrron commented 6 years ago

On-node DNS resolving is broken as of https://github.com/freifunk-gluon/gluon/commit/b3d7011130a3b6311ed80bdca3a11a356d7d2b2a - quick fix applied by reverting said commit in https://github.com/freifunk-gluon/gluon/commit/78ed75ee2858c93ca30fbea17188eed38ff65614

Solution by @christf is pending.

christf commented 6 years ago

Due to the revert batman-based nodes are unaffected, babel-nodes with ipv6 on their wan-ports may be affected: DNS-requests may leave the node through the wrong interface, leading to not working DNS. As a consequence, clients will not experience a working network and the autoupdater will not work.

Besides this hack in commit b3d7011 we could consider adding src stanzas to all the babel-routes. I am working on a patch for this.

christf commented 6 years ago

A babel patch exists now. This issue will be fixed by using a babel version based on https://github.com/christf/babeld/tree/PREFSRC and an appropriate babeld configuration.

rotanid commented 6 years ago

@christf maybe you can post a short update on your progress and plans, i think we should talk about removing the feature again if there's no chance to get it fixed completely soon?

christf commented 6 years ago

Following a conversation on the babeld mailing list a patch was prepared by myself that is ready and working. A PR is raised upstream in babeld - see here: https://github.com/jech/babeld/pull/18. Unfortunately it is not merged yet. The only symptom by this feature not being merged is that on multi-homed nodes (having ipv6 on WAN) the kernel decides which outbound IP address is used when doiing requests sometimes causing DNS leaks.

imho not a big deal - still it is nice to get this fixed which will happen after prefsrc is merged. in babeld there are tests going on for preparing a merge of the unicast branch on which this is all based on.

There is no need to revert anything in gluon.

rotanid commented 6 years ago

ok, thanks for the update! i removed the "regression" label, as nothing else beside the new feature is broken anymore. i added a release blocker label instead.

rotanid commented 5 years ago

as far as i see it, the fix has been pushed to master branch of babeld

christf commented 5 years ago

The fix was pushed to babeld and the required configuration has been tested for two months now. Waiting for modules update to openwrt master where babeld 1.9.1, containing the patch, has just been included.

rotanid commented 4 years ago

@christf are we now on a openwrt master status which includes this? i think we are, as far as i can see it.

what does this mean, there is no issue anymore?

and what about the initial change which wanted to make DNS requests to leave via specific interfaces? @NeoRaider

rotanid commented 4 years ago

@christf @NeoRaider ping?

neocturne commented 4 years ago

The original change was broken (the 'loopback' interface doesn't even have the primary node address in non-Babel setups), but it is also too specific - all protocols need to use the correct interfaces, DNS is not special in any way.

rotanid commented 4 years ago

@NeoRaider so "someone" needs to work this out "sometime" ? :dagger:

@christf are we now on a openwrt master status which includes this? i think we are, as far as i can see it.

what does this mean, there is no issue anymore?

christf commented 4 years ago

tfor babeld we now have the ability to specify the correct src-address. This was merged in november with #1877. This leads to ipv6 traffic leaving the correct interface without other hacks. For Batman we might still have an issue when a node has ipv6 on its wan interface.

rotanid commented 4 years ago

if you think this affects batman too, do we need a new issue? what exactly is the issue? because this one here is about babel.

do we need something like b3d7011130a3b6311ed80bdca3a11a356d7d2b2a for batman? @NeoRaider

christf commented 4 years ago

We do not need something like b3d7011. If we need something, it should be like https://github.com/freifunk-gluon/gluon/pull/1877/commits/2389679380a96ef0f01dfd5b14715a1c1c94e0ea in a sense that pref-src should be set on the routes if a fix is required at all.

rotanid commented 4 years ago

as @christf and i just weren't able to reproduce a problem with the src address running Gluon v2019.1.1 with batman-adv, maybe there is no issue left. waiting for @NeoRaider opinion though...

The x86 node we tested with had a public IPv6 address on both br-client and br-wan and two default routes similar to this:

default via fe80::7eff:4dff:XXXX:1db1 dev br-wan  metric 512 
default via fe80::4459:c0ff:XXXX:4fd0 dev br-client  metric 512 

The node also had IPv6 nameservers for both br-wan and br-client. We ran tcpdump -pni br-wan src $ipv6-adresse-from-br-client and tcpdump -pni br-client src $ipv6-adresse-from-br-wan in parallel and then issued multiple commands in a loop to trigger local-origin traffic like "autoupdater -f" and restarting sysntpd

neocturne commented 4 years ago

@rotanid Are these default routes in different routing tables? The WAN default route should never appear in the main table.

I'm aware of only one serious issue regarding source address / interface selection in current Gluon (with batman-adv, but likely also affects babel): #1132. Is there still anything to fix for babel?

rotanid commented 4 years ago

@NeoRaider i did not query any routing table on purpose, afair i only did ip ro - does this help with your question?

regarding babel, as far as i understand the comment from @christf it's only about batman now

neocturne commented 4 years ago

@rotanid Hmm, I think that means that the br-wan default route was in the default table as well, weird - but I don't fully trust the busybox ip, so I might be wrong. If things are working as intended, only the mesh routes should end up in the main table (ip -6 r s table 254), while br-wan should be handled in a separate table (ip -6 r s table 1). (For some reason, "table" needs to be written out with busybox ip ...)

rotanid commented 4 years ago

@NeoRaider with busybox ip there are 27 routes. all in the table 254 output, none in table 1 output. with "ip-full" installed, there are 12 routes in table 254, 4 in table 1 and 22 in table "local". one default route is in table 254, one in table 1 all other routes look like they are in the correct table, too. (note: everything with IPv6 parameter -6 only)

any other test cases to be able to say this issue does not exist with batman-adv setups?

neocturne commented 4 years ago

Okay, I think this can be closed (and indeed Busybox ip is useless when it comes to multiple routing tables...).

The remaining weirdness in our routing setup is tracked in #1132.