wireguard: bridge fdb entry sometimes missing?

lemoer commented 3 years ago

I just found for my router, that the bridge fdb entry for 00:00:00:00:00:00 was missing when I used bridge fdb. Only 72:4c:e2:db:6f:37 dev vx-99 dst fe80::247:34ff:fef4:26cc via wg-99 self is visible.

Details:

Wg handshakes are established.
The router received nothing on the vpn interface.
On the router, I didn't see a batctl neighbor on vpn.
On sn10 I saw the node as neighbour.
systemctl restart wg_netlink.service didn't help.
Rebooting solved the issue.
- But the entry for 00:00:00:00:00:00 is still not existing.
- But seems to work.

We should keep an eye on this.

AiyionPrime commented 3 years ago

You did keep in mind that some of our tools filter the 00:00:00:00:00:00 entry?

lemoer commented 3 years ago

No, I didn't. Which tools filtering it?

AiyionPrime commented 3 years ago

I thought we did in wg_established; but that one's implicit, as the entry does just never handshake and is therefore blocked by awk. Will look into this again.

CodeFetch commented 3 years ago

E.g. the statistics export https://github.com/freifunkh/ansible/commit/5d22e2418173d2780bfe14e080ed14aff08d1905

AiyionPrime commented 3 years ago

True, but not what I had in mind. Maybe the was a shell script before that filtered; or something in our netlink.py.

https://github.com/freifunkh/ansible/blob/5d22e2418173d2780bfe14e080ed14aff08d1905/roles/ffh.mesh_wireguard/files/bin/ffh_wg_stats.py#L18

CodeFetch commented 3 years ago

@lemoer Why should there be an fdb entry for 00:00:00:00:00:00? Does it have to do anything with the dummy peer at all?

AiyionPrime commented 3 years ago

Interesting: for i in 01 07 08 09 10; do echo sn$i; ssh zone.ffh.s.sn$i -C bridge fdb | grep 00:00:00:00:00:00; done

sn01
00:00:00:00:00:00 dev vx-14 dst fe80::2ce:7ff:fe40:5a6c self permanent
00:00:00:00:00:00 dev vx-20 dst fe80::2fa:fcff:feb9:c861 self permanent
00:00:00:00:00:00 dev vx-21 dst fe80::28b:f8ff:fe51:88ee self permanent
00:00:00:00:00:00 dev vx-99 dst fe80::2be:e1ff:feac:7147 self permanent
sn07
sn08
sn09
00:00:00:00:00:00 dev vx-15 dst fe80::231:b7ff:fea4:a410 self permanent
00:00:00:00:00:00 dev vx-15 dst fe80::209:95ff:fe01:9ea9 self permanent
00:00:00:00:00:00 dev vx-16 dst fe80::2a5:a4ff:feec:563a self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::213:18ff:fe6e:f314 self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::252:acff:fee3:caeb self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::21b:e3ff:fe04:4409 self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::256:3cff:fe07:1fce self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::292:3cff:fe3e:21d8 self permanent
sn10
00:00:00:00:00:00 dev vx-19 dst fe80::2d7:55ff:fe3e:dbbc self permanent
00:00:00:00:00:00 dev vx-20 dst fe80::255:cdff:fe56:6f7d self permanent
00:00:00:00:00:00 dev vx-99 dst fe80::2e3:6cff:fe5d:d07c self permanent
00:00:00:00:00:00 dev vx-99 dst fe80::247:34ff:fef4:26cc self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::2b6:3dff:fe32:5577 self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::2ff:96ff:fe41:1b70 self permanent
00:00:00:00:00:00 dev vx-15 dst fe80::25a:5dff:fed9:de19 self permanent
00:00:00:00:00:00 dev vx-13 dst fe80::2af:aeff:fe58:3cdb self permanent

AiyionPrime commented 3 years ago

Maybe we should add a script, that allows to reproduce this issue with less effort in order to have more eyes on it?

lemoer commented 3 years ago

Maybe it's not a bridge fdb problem. I just observed on sn01, that the vx-... interfaces are not added to batman:

sn01:

[root@sn01]:~ # ls -d /sys/class/net/bat* | cut -d '/' -f 5 | xargs -n 1 -I X batctl meshif X if | grep vx
[root@sn01]:~ #

sn09:

[root@sn09]:~ # ls -d /sys/class/net/bat* | cut -d '/' -f 5 | xargs -n 1 -I X batctl meshif X if | grep vx
vx-10: active
vx-11: active
vx-12: active
vx-13: active
vx-14: active
vx-15: active
vx-16: active
vx-17: active
vx-18: active
vx-19: active
vx-20: active
vx-21: active
vx-22: active
vx-23: active
vx-99: active

lemoer commented 3 years ago

As quickfix I fired:

ls -d /sys/class/net/bat* | cut -d '/' -f 5 | grep -v bat0 | sed 's_bat__g' | xargs -n 1 -I XX systemctl start add-vx-to-batX@XX.service

My node is now properly connected. But I am not sure whether all problems described in this issue are solved.

AiyionPrime commented 3 years ago

I'll look into it tomorrow in the afternoonn

lemoer commented 3 years ago

I added the milestone "Beginn der stabilen Phase", as this is likely to be a bug. But as this happens sporadically, I am not sure, whether we will resolve this issue before the "stabile Phase".

lemoer commented 3 years ago

I implemented a fix for the mentioned issue in 5fc0673.

But I am not sure whether all problems described in this issue are solved.

AiyionPrime commented 3 years ago

If what you did in 5fc0673f31a435ea27903f475aa57de697f96722 is indeed a fix, we need to rewrite wait_for_iface.sh, as it's then broken, right?

lemoer commented 3 years ago

I think, the discussed problem here is the same as #175.

lemoer commented 3 years ago

It does not make sense to have either #175 or #147 (this issue) as blocker for the infrastructure freeze week, so I'll remove the milestone here.

lemoer commented 3 years ago

Today there appeared a similar issue, but this time only the route is missing and the fdb entry is there. Maybe it's related, maybe not...

(Originally reported by @bschelm via Mail.)

I collected some data:

WG is established:

root@NDS-PoE-Test1:~# ubus call wgpeerselector.vpn status
{
    "peers": {
        "sn07": false,
        "sn01": false,
        "sn09": false,
        "sn10": {
            "established": 12262
        },
        "sn05": false
    }
}

WG is established:

[root@sn10]:~ # ffh_wg_established.sh | grep dom14
95  dom14   /etc/wireguard/peers-wg/aiyion-JT-OR750i
1819344 dom14   /etc/wireguard/peers-wg/charon
595543  dom14   /etc/wireguard/peers-wg/nds-esperanto
11446   dom14   /etc/wireguard/peers-wg/nds-fwh-tresckowstr-technik-vorne
3   dom14   /etc/wireguard/peers-wg/nds-poe-test1
2268739 dom14   /etc/wireguard/peers-wg/nds-schwule-sau
1684077 dom14   /etc/wireguard/peers-wg/nds-the-dalek-mothership
1643047 dom14   /etc/wireguard/peers-wg/nds-the-tardis
683281  dom14   /etc/wireguard/peers-wg/wgtest-1043-lemoer

IPv6 of the router:

root@NDS-PoE-Test1:~# ip a s vpn
12: vpn: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none 
    inet6 fe80::2dc:dfff:fecc:981d/128 scope link 
       valid_lft forever preferred_lft forever
root@NDS-PoE-Test1:~# ip a s vx_vpn_wired
15: vx_vpn_wired: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1330 qdisc noqueue master bat0 state UNKNOWN group default qlen 1000
    link/ether 02:29:04:5d:75:e7 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::29:4ff:fe5d:75e7/64 scope link 
       valid_lft forever preferred_lft forever

But no appropriate route is installed:

[root@sn10]:~ # ip -6 route | grep -i wg-14
fe80::213:18ff:fe6e:f314 dev wg-14 proto static metric 1024 pref medium
fe80::/64 dev wg-14 proto kernel metric 256 pref medium

Bridge fdb entry is ok:

[root@sn10]:~ # bridge fdb list | grep wg-14
1e:bd:8f:52:15:d7 dev vx-14 dst fe80::213:18ff:fe6e:f314 via wg-14 self 
02:29:04:5d:75:e7 dev vx-14 dst fe80::2dc:dfff:fecc:981d via wg-14 self

lemoer commented 3 years ago

Even if we restart the service, the route is not created...

Some analysis is following:

Here we see that we have 91 peers per interface:

[root@sn10]:~ # wg | grep -e '^[^ ]' | cut -d ' ' -f 1 | uniq -c
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:

A small patch applied to netlink.py:

diff --git a/netlink.py b/netlink.py
index 31a1e76..743dfdb 100644
--- a/netlink.py
+++ b/netlink.py
@@ -97,10 +97,13 @@ class ConfigManager:
         with WireGuard() as wg:
             clients = wg.info(self.wg_interface)[0].WGDEVICE_A_PEERS.value

+            print(f"LEN: {len(clients)}, iface={self.wg_interface}")
             for client in clients:
                 latest_handshake = client.WGPEER_A_LAST_HANDSHAKE_TIME["tv_sec"]
                 public_key = client.WGPEER_A_PUBLIC_KEY["value"].decode("utf-8")

+                print(f"A: {public_key}")
+
                 peer = self.find_by_public_key(public_key)
                 if len(peer) < 1:
                     peer = WireGuardPeer(public_key)

Shows only 89 or 90 peers:

[root@sn10]:~ # /usr/bin/python3 /srv/wireguard/vxlan-glue/netlink.py -c /etc/wireguard/netlink_cfg.json | grep LEN
LEN: 90, iface=wg-10
LEN: 90, iface=wg-11
LEN: 90, iface=wg-12
LEN: 90, iface=wg-13
LEN: 89, iface=wg-14
LEN: 89, iface=wg-15
LEN: 89, iface=wg-16
LEN: 90, iface=wg-17
LEN: 89, iface=wg-18
LEN: 90, iface=wg-19
LEN: 90, iface=wg-20
LEN: 90, iface=wg-21
LEN: 90, iface=wg-22
LEN: 90, iface=wg-23
LEN: 90, iface=wg-99

(Even if 90 would only be an off by one discrepancy, this discrepancy would not be consistent, as a few interfaces also have 89 only.)

lemoer commented 3 years ago

The recent finding of this week has now been fixed in https://github.com/freifunkh/wireguard-vxlan-glue/commit/7c876de05a30f4ff946065a34d5f9a63555d316f .

1977er commented 1 year ago

Can this be closed as of the last comment? Or is there anything we can / should regularily test (via monitoring)?

AiyionPrime commented 1 year ago

I think it reads as if the finding of that week and not the whole issue was resolved. But maybe it has been regardless.

freifunkh / ansible

wireguard: bridge fdb entry sometimes missing? #147

I collected some data: