[IDEA] FFMUC Routed only with B.A.T.M.A.N. inside the local mesh

awlx commented 3 years ago

This is a draft idea how we could switch FFMUC to a routed approach without losing functionality of B.A.T.M.A.N. for local meshes.

Problem statement

We want to switch Freifunk Munich to a routed approach towards the gateways, because large layer2 domains pose too many problems. Also we want to get rid of the overhead of VXLAN and B.A.T.M.A.N. towards the gateways.

Idea

Use wireguard to connect to the Freifunk Munich gateways
Inside wireguard use a calculated link-local address which is derived from the public key
v6
- Run radvd on nodes which have an established wireguard tunnel to announce the v6 /64 inside the local network
- the local /64 is assigned via wgkex
- Default route via the wireguard tunnel
v4
- Use a fixed /20 per segment and set the next-hop to the v6 address of the gateway, also NAT on the node itself.
- The node runs DHCP thus it becomes the default gateway for the local network. Also set B.A.T.M.A.N. GW Mode to server.
- We need a transfer network between gateway and node
Meshing
- The node runs B.A.T.M.A.N. for local meshing just the same as on "normal" Gluon
Why not babel?
- We want to stay compatible to old nodes, which can just mesh like before.
- A routing protocol is not needed in this approach, thus we avoid another failure domain.

What needs to be done?

Test setup with that approach (two raspberry PIs or smth)
Changes to gluon (dhcp-server, radvd, nat)
wgkex needs to get a backend database from which transfer v4 addresses are picked
wgkex also needs to have an database for v6 /64

Possible issues

Kernel of OpenWRT is too old and doesn't support v6 next-hops for v4
Meshing freaks out
IP address conflicts while roaming

Known Issues

No IPv4 Connectivity between clients which are not in the same local mesh
Potential IPv4 collisions in spontaneous meshes

Glossar

Nodes => Freifunk Router
Gateway => Supernode

Discussion

https://chat.ffmuc.net/freifunk/channels/firmware

Comments welcome! 🚀

SmithChart commented 3 years ago

Freifunk Braunschweig has been working on such net for some time now. Codename is Parker (not BATMAN, but still something with nets): https://freifunk-bs.de/parker.html Our small test-network is online and works like a charm: https://freifunk-bs.de/parker.html

We are currently working on making the migration path form classic gluon to parker smooth. Afterwards we want to migrate our domain to that new technology.

We have talked a bit about it back in 2018: https://stratum0.org/blog/posts/2018/11/22/freifunk-parker/

If you are interested we would like to share our results, architecture and code with you if you are interested. One option would be our weekly meetup (Wednesdays, 19:00, see freifunk-bs.de) or let' schedule a conference :)

Cheers Chris

awlx commented 3 years ago

Hi @SmithChart that sounds great :). I already read about your project in the past.

lemoer commented 3 years ago

Nice and interesting post! (I hope it is okay, that I am posting here)

IP address conflicts while roaming

I have been wondering for some time how relevant this is....

@NeoRaider always emphasized that roaming from any node in the network to any other node in the network must be possible. I don't remember the reasons and his explanations in detail.

I for one wonder if this is even a relevant case. For now, I would guess that devices are only roaming within the local mesh cloud in most cases. But this is only a gut feeling, since I haven't seen any evaluation of the probability of such "long distance" roaming behavior occurring yet. Perhaps @T-X knows of such evaluations? I think we discussed some time ago that such evaluations would be nice (in context of the batman translation table crc bug and batadv-scapy or something?).

However, I would love to see this and am looking forward to hear your reports.

Cheers lemoer

SmithChart commented 3 years ago

At least for our (legacy) network we usually do not see meshes that do not see each other but allow clients to roam between the SSIDs. Our net just is not dense enough.

As long as both meshes have at least one mesh-link they are a single BATMAN-domain and thus traffic will be routed to the correct router. Especiallay for IPv6 the client will see advertisements from the next close rotuer (since we use radvfilterd to align the BATMAN ipv4-filter with the ipv6-filters) and traffic will gradually shift to the closer router.

Our last part of the puzzle are really short lease times and frequent RAs.

awlx commented 3 years ago

That's also what we see here, if there is no mesh roaming basically does not happen anyway. As by the time the client arrives at the other AP it needs to request a new lease (which is 5min in our case) anyway. So this is not really a concern of us.

Also some people have risen the argument of cross client traffic, which basically is also never observed in our meshes or even domains. Most people just want to go the internet or connect to a spotify device right next to their current place, which will still be possible as one mesh shares the broadcast domain.

T-X commented 3 years ago

IP address conflicts while roaming

Regarding this point and adding to @lemoer's comment. IEEE 802.11-2016 defines an ESSID as:

[...] The key concept is that the ESS appears the same to an LLC layer as an IBSS. STAs within an ESS can communicate and mobile STAs might move from one BSS to another (within the same ESS) transparently to LLC. [...]

Or on Wikipedia:

Extended service set: [...] It is a set of one or more infrastructure basic service sets on a common logical network segment (i.e. same IP subnet and VLAN).

https://en.wikipedia.org/wiki/Service_set_(802.11_network)#Extended_service_set

So in practice, a wpa-supplicant for instance will roam from one AP to another with the same ESSID as it thinks fits (e.g. by signal strength). But wpa-supplicant will not re-negotiate DHCP upon roaming, as by definition of an ESSID it assumes it is still the same broadcast domain.

T-X commented 3 years ago

@awlx

That's also what we see here, if there is no mesh roaming basically does not happen anyway.

One counter example is an intersection of two streets:

(a)
======= cl  =======
        ||
        ||
        ||
        || (b)

You might have a node in one street and another one in a crossing street (a+b). These two do not see and therefore do not mesh with each other as there are big buildings in between. However a user (cl) with a mobile device can see both APs when it is at the intersection.

And the client device will be forced to roam when moving from one street into the other and by that moving out of sight of the AP it was initially connected to.

herbetom commented 3 years ago

The question is how often this is really the case. But that is probably very different depending on the community.

Is it possible (and how complex would it be) to signal the premature end of the lifetime to a client if it tries to do something with a wrong address?

T-X commented 3 years ago

Is it possible (and how complex would it be) to signal the premature end of the lifetime to a client if it tries to do something with a wrong address?

For IPv6 sending unsolicited Router Advertisements with a reduced (zero?) lifetime or adjusted default router preference (RFC4191) should be possible. For DHCP there seems to be a "DHCP reconfigure extension" (RFC3203) but not sure how widely it is implemented on the client side. Also that in turn seems to require "Authentication for DHCP Messages" (RFC3118) which might make this a bit more complicated in a distributed, multi-party network.

In theory there is/was also 802.11f (Inter-Access Point Protocol). But it seems it was withdrawn in 2006? (unless I'm reading the Wikipedia and IEEE timeline wrong)

https://tools.ietf.org/html/rfc4191 https://tools.ietf.org/html/rfc3203 https://tools.ietf.org/html/rfc3118 https://en.wikipedia.org/wiki/Inter-Access_Point_Protocol https://grouper.ieee.org/groups/802/11/Reports/802.11_Timelines.htm

goligo commented 3 years ago

Shouldn't it also be possible for an AP to terminate the Wifi connection, to force the client to reconnect and send a new DHCP request? As far I understand there is a Disassociation Frame as well as a Deauthentication Frame available in 802.11 to achieve this.

https://mrncciew.com/2014/10/11/802-11-mgmt-deauth-disassociation-frames/

awlx commented 3 years ago

You need to analyse the traffic of the client then and watch out for wrong leases and stuff.

goligo commented 3 years ago

For v4, why do you want to make NAT and DHCP on the node? Why not keep it on the gateway?

goligo commented 3 years ago

To sum up some discussion on chat.ffmuc.net on this topic:

NAT and DHCP must be done on the node, if there is no L2 network connection to the gateway. How ever the gateway will NAT again, so for v4 the source address of requests will be the gateway.
Regarding a fixed /64 network for nodes, when using v6. This means the originating node can be identified bases on a logged IP address even weeks or months later. There was a lengthy discussion around this, on the one hand the arguments were that Freifunk is not a anonymizer service, nodes can also be identified today, if someone connects to the BATMAN mesh. On the other hand it is part of the promise Freifunk makes to people running nodes, that they will no be made liable for things other people do when connected to their node, and being able to identify the originating node, based on a IP address only, does not make it easier to do so. A possible mitigation for this issue would be either to use NAT66 or to rotate the /64 in regular intervals.

awlx commented 3 years ago

Both NAT66 and rotating prefixes are a bad idea to run own services on the internet, which then involve dyndns and during rollover stuff could break. I won't recommend the use of that as it's a step backwards and not forward.

I am also not sure what the implication is of saying "a node with key x and address space y was connected" as it's clearly used for Freifunk and nobody knows who was connected. But that's something a lawyer has to decide.

Technically the more NAT the more broken the network and we should treat IPv6 as a first class citizen and not break it just as much as IPv4.

goligo commented 3 years ago

I don't get the "run own services" argument - we are talking about public hotspots to provide internet access, I would not expect and not want anyone to offer services over my Freifunk node.

awlx commented 3 years ago

For v4, why do you want to make NAT and DHCP on the node? Why not keep it on the gateway?

If we NAT on the gateway we need routable prefixes to the Client network ... which means we need to hand out big enough networks and stuff.

awlx commented 3 years ago

I don't get the "run own services" argument - we are talking about public hotspots to provide internet access, I would not expect and not want anyone to offer services over my Freifunk node.

Freifunk at a time was meant to enable such stuff, as you can just provide things and everyone can connect that's part of the original idea.

goligo commented 3 years ago

I don't get the "run own services" argument - we are talking about public hotspots to provide internet access, I would not expect and not want anyone to offer services over my Freifunk node.

Freifunk at a time was meant to enable such stuff, as you can just provide things and everyone can connect that's part of the original idea.

I think we should differentiate between people actively taking part in the Freifunk network (node owners) who should be able to do so, and random anonymous WiFi users, where I don't see any benefit, but a lot of risk, if they get stable IP addresses and can host stuff on the network.

awlx commented 3 years ago

If someone wants to avoid all risk they shouldn't choose to become part of the Freifunk network. We are not here to enable illegal activity, for that other services exist.

What we want to achieve is to protect people who provide access to the network (node owners) from being sued for copyright infringement and other stuff the users did. And this is still the case, it doesn't mean we protect everyone connecting to the network from this.

goligo commented 3 years ago

What we want to achieve is to protect people who provide access to the network (node owners) from being sued for copyright infringement, which they didn't do. And this is still the case.

I doubt that this is still the case, if a copyright infringement can easily be traced back to the originating node.

awlx commented 3 years ago

But it doesn't say that the person who runs the node did it. As they just provide an open access network. But as said, that's for a lawyer to decide.

goligo commented 3 years ago

Why not leave the decision to the person running the node, whether he wants a stable /64 or not?

awlx commented 3 years ago

Because that will introduce much overhead for us, on which basis should the /64 be chosen? How long does the node keep it? Our server still has the logs who got which network.

awlx commented 3 years ago

It's also possible to just log the B.A.T.M.A.N. claimtable over time and ask for mac addresses. Those nexthops are know to us at any time, because that's how it all works. Then we can also hand out the IP address of the none owner ... because it's clear which one it is.

goligo commented 3 years ago

It's also possible to just log the B.A.T.M.A.N. claimtable over time and ask for mac addresses. Those nexthops are know to us at any time, because that's how it all works. Then we can also hand out the IP address of the none owner ... because it's clear which one it is.

But we don't do this kind of logging - for a good reason.

awlx commented 3 years ago

We don't but do we know who does it? No we don't. The layer2 is more risky than anything else traffic can also just be redirected through other nodes without us even noticing.

goligo commented 3 years ago

I fully agree that we should switch from the giant L2 to a routed approach and use BATMAN only where it belongs, in the local WiFi meshes. All I am asking for is that nodes should be able to decide how often they request a new /64 from the gateway, instead of getting a fixed one.

awlx commented 3 years ago

But which pool? For how long do we mark /64 as stale and not usable for others?

I don't see any benefit here only operational overhead, looking at Freifunk Franken who also do fixed /64 and it works. @fblaese

awlx commented 3 years ago

Also we then have to do the same for IPv4, as any WebRTC call will leak all addresses ... the NAT address of the Node as well as external NAT, as well as internal Pool.

goligo commented 3 years ago

Oh, I see a clear benefit, which is protection of node owners from liability for things third parties are doing using their node for internet access.

awlx commented 3 years ago

Should make 0 difference as said ... it's no trackable as well.

awlx commented 3 years ago

But before ... we de-rail this thread even more from technical standpoints to only meta discussions, we should first try if this approach would even work ... maybe it makes no sense at all from a technical stand-point and this whole discussion was unnecessary.

awlx commented 3 years ago

So best thing would be if someone tests the technical aspects and proofs that the idea is possible.

awlx commented 3 years ago

Maybe a good thing for @ce-4, @lqb and @goligo to play with ... as it's a good chance to learn and some want to get rid of B.A.T.M.A.N. traffic on the Unifi controller. Also this will lead to a deep understanding of Gluon, Gateways and B.A.T.M.A.N.

fblaese commented 3 years ago

I don't see any benefit here only operational overhead, looking at Freifunk Franken who also do fixed /64 and it works. @fblaese

Up until now we haven't had any issues with static address assignments. For conveniance, we assign IPv6 prefixes anonymously (see https://sub.f3netze.de), which are then announced in our babel network by router opterators. There is no guarantee that a prefix, that is currently announced by a router, always has been located there. We also do not log annoucments (anybody could, though), so from a liability standpoint this should mostly be equivalent to to B.A.T.M.A.N. advanced networks.

lqb commented 3 years ago

@awlx lets meet.ffmuc.net . I want to be sure zu Talk about the same Thing.

awlx commented 3 years ago

We can discuss this in https://chat.ffmuc.net/freifunk/channels/noc. And should work async on this.

awlx commented 2 weeks ago

unser super-repo ist nun auf stand: https://gitli.stratum0.org/ffbs/ffbs-gluon hier unsere site: https://gitli.stratum0.org/ffbs/ffbs-site zwischenstand packages ist hier: https://github.com/SmithChart/community-packages/tree/topic/parker und der aktuelle gluon-parker base-stand: https://github.com/ffbs/gluon-parker/

awlx commented 2 weeks ago

in-Person Treffen wird am 19./20. Oktober stattfinden. FFMUC Parker Firmware: https://github.com/freifunkMUC/site-ffm/tree/parker FFBA im Mumble http://telmir.stratum0.org/ Meeting Notes: https://pad.stratum0.org/p/freifunk_20240717_parker

T-X commented 2 weeks ago

Very nice to see the progress and collaboration on this! Some sort of clustering is definitely a great way to increase scalability. Four things I would be very interested in (does not need to be answered / discussed here, but I would love to read more about them in some meeting notes, FAQs or test results in the future)

1) Any plans to integrate DDHCP maybe? Would that allow use smaller IPv4 prefixes for ffmuc? (the pad says ffmuc would need a /10) 2) What happens if the two nodes with an uplink only have a (temporarily?) bad WiFi mesh connection? Or even if they had for instance a stable 1 MBit/s throughput over WiFi mesh, the WiFi would then still always be preferred, even if there were in theory a 1 GBit/s fiber available over a mesh-vpn? (I'm wondering if it could make sense to have batman-adv not between all nodes (of a domain), but between these uplink nodes that share the same WiFi mesh at least?) 3) Has anything changed with the roaming situation on modern cellphone operating systems? (Maybe Android and iPhones got more clever regarding when to get a new IP address? Does anyone have any current experience with the roaming behaviour between APs with the same ESSID but differing IP ranges?) 4) This (for now) seems to be a bit incompatible/divergent with the multicast related progress? Though maybe it wouldn't be that difficult to integrate/add later. Maybe just adding pim6sd and enabling it on uplink nodes (or just one uplink node in the local mesh, to avoid redundant multicast streams due to RFC4541 to each uplink node?) would mostly be enough. (I know, this might be a bit "opinionated" topic and I agree that especially if a protocol uses multicast just like broadcast, that it does not scale well. And will need more field testing. But I think generally there was a lot of progress on this in the last 10 years: More vendors understanding/implementing RFC4541, Linux bridge multicast snooping finally works after ~4 years+ of bugfixing, it seems Android has finally fixed their MLD firewall bug and there is a workaround for it in Gluon, there are now Forward-Error-Correction / RaptorQ RFCs for RTP with support in gstreamer, batman-adv now uses IGMP/MLD snooping and has a new multicast packet type, low MLD "noise" implementation in Gluon, routeable multicast support in batman-adv (+Gluon, upstreaming WIP). And I still love this concept in general to avoid needing big, central content servers, to enable "the small people" too, to reach many people, without needing to pay directly or via ads to the big servers :-) .)

freifunkMUC / site-ffm