Bridge FDB entries get stuck

neocturne commented 7 years ago

Scenario:

Client 1 is connected to node A via ethernet. Client 2 is connected to node A via WLAN. Packets between client 1 and client 2 flow through the ports client0 and eth0 of br-client.

When client 2 roams to another node B, br-client on node A doesn't (always?) update the port mapping; even after many packets have flown from client 2 to client 1 (which are now forwarded from bat0 to eth0), the FDB entry for client 2 is not updated on node A; packets from client 1 to client 2 are still forwarded to client0 instead of bat0, and in consequence don't reach client 2. Only after the FDB timeout (5 minutes?), client 1 and client 2 can communicate in both directions again.

The issue is easily reproducible with the Gluon master; as I didn't notice it before, I assume that v2016.2.x is unaffected (but I didn't check yet). My setup was based on batman-adv compat 15 (I don't think the compat version is relevant though).

neocturne commented 7 years ago

@T-X: You know the bridge code better than me, any ideas?

T-X commented 7 years ago

Sounds like an issue introduced by d5829d87be. Hm, looks like setting "learning = 0" for bat0 not only disables learning new hosts on bat0, but also disables unlearning hosts on other interfaces which roamed to bat0 :-(.

Easiest fix would be to disable learning for the wifi client interfaces, too. Should be a safe change, as mac80211 will be able to drop a few cycles later, if the host is not actually there on the wifi side anymore.

However, it would still leave the same issue for a host roaming from an ethernet port to some other node.

The most clean option with no side effects I can currently think of is making a patch for the Linux bridge to add something like a "learning = 2" option which would disable learning new hosts on the according port but would leave unlearning enabled.

T-X commented 7 years ago

So, for a hotfix, we could either:

A) Disable learning on wifi interfaces -> Drawback: Timeouts still remains for wired roaming

B) Revert this patch for now -> Drawback: Maybe memory issues on devices with a small amount of RAM again?

(And then later go for a patch for the bridge, as described above.)

neocturne commented 7 years ago

Disabling learning, but not unlearning, seems pretty weird to me... how much RAM per FDB entry are we talking? If the implementation is sane, 64 byte per entry should be sufficient (very rough estimate without looking into it in detail), so 64KB for 1000 entries - is this really an actual issue?

I don't think we should further break wired roaming - while we tell people to avoid plugging dumb APs into the client ports of Gluon nodes, it's still done, and with learning disabled, leads to hard to debug roaming issues...

So I'd like to go with B) for now.

rotanid commented 7 years ago

where does it say that you shouldn't plug APs into client ports? that is a very common way to do it among Freifunkas for larger deployments where the meshing doesn't scale and/or isn't necessary

neocturne commented 7 years ago

@rotanid Such clients won't be marked as "wireless" by batman-adv, so I'm not sure how well roaming will work for them when multiple Gluon nodes are involved... (not an issue when clients are just roaming between APs connected to the same Gluon node)

Generally, using a Gluon node with disabled wireless meshing is recommended for such setups; for small setups, using a single Gluon node (and in case roaming to unrelated nodes could happen, a distinct AP ESSID) would also work.

neocturne commented 7 years ago

@T-X A net_bridge_fdb_entry is 40 bytes long; I guess SLAB_HWCACHE_ALIGN will pad it to 64 bytes. Even though it may be redundant with batman-adv, I don't see how these numbers can lead to problematic memory usage in any way.

T-X commented 7 years ago

Those were my test results back then: https://www.open-mesh.org/projects/batman-adv/wiki/Kmalloc-kmem-cache-tests

Disabling fdb entries from bat0 seemed to have increased the "amount-of-hosts-until-oom-crash by 14% (in the raw table, each test variant compared to its "nofdb" variant).

My interpretation back then was similar to the one for the kmalloc() vs. kmem-cache-alloc() patch for batman-adv: Memory fragmentation, because both are still unmovable / non-defragmentable memory regions. Using a fitting kmem-cache probably mitigates the issue but probably does not solve the underlying issue completetly.

So, even if they are just 64 bytes large, even if using kmem-cache-alloc() instead of kmalloc() using many such entries while running on a tight, available memory space (what was our current average size for small-mem devices, something like 70-75%?) might result in memory fragmentation and ultimately OOM issues, I think.

T-X commented 7 years ago

However, dumb APs connected to a node's ethernet port are a good point. That could lead to noticeable issues with suggestion A), right.

Hm, we could give solution B) a shot for now. And see whether we get new OOM reports again. In the mean time, I can work on a solution C), a patch for the bridge, just in case.

freifunk-gluon / gluon

Bridge FDB entries get stuck #1121