Open lemoer opened 3 years ago
You did keep in mind that some of our tools filter the 00:00:00:00:00:00
entry?
No, I didn't. Which tools filtering it?
I thought we did in wg_established; but that one's implicit, as the entry does just never handshake and is therefore blocked by awk. Will look into this again.
E.g. the statistics export https://github.com/freifunkh/ansible/commit/5d22e2418173d2780bfe14e080ed14aff08d1905
True, but not what I had in mind. Maybe the was a shell script before that filtered; or something in our netlink.py.
@lemoer Why should there be an fdb entry for 00:00:00:00:00:00? Does it have to do anything with the dummy peer at all?
Interesting:
for i in 01 07 08 09 10; do echo sn$i; ssh zone.ffh.s.sn$i -C bridge fdb | grep 00:00:00:00:00:00; done
sn01
00:00:00:00:00:00 dev vx-14 dst fe80::2ce:7ff:fe40:5a6c self permanent
00:00:00:00:00:00 dev vx-20 dst fe80::2fa:fcff:feb9:c861 self permanent
00:00:00:00:00:00 dev vx-21 dst fe80::28b:f8ff:fe51:88ee self permanent
00:00:00:00:00:00 dev vx-99 dst fe80::2be:e1ff:feac:7147 self permanent
sn07
sn08
sn09
00:00:00:00:00:00 dev vx-15 dst fe80::231:b7ff:fea4:a410 self permanent
00:00:00:00:00:00 dev vx-15 dst fe80::209:95ff:fe01:9ea9 self permanent
00:00:00:00:00:00 dev vx-16 dst fe80::2a5:a4ff:feec:563a self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::213:18ff:fe6e:f314 self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::252:acff:fee3:caeb self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::21b:e3ff:fe04:4409 self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::256:3cff:fe07:1fce self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::292:3cff:fe3e:21d8 self permanent
sn10
00:00:00:00:00:00 dev vx-19 dst fe80::2d7:55ff:fe3e:dbbc self permanent
00:00:00:00:00:00 dev vx-20 dst fe80::255:cdff:fe56:6f7d self permanent
00:00:00:00:00:00 dev vx-99 dst fe80::2e3:6cff:fe5d:d07c self permanent
00:00:00:00:00:00 dev vx-99 dst fe80::247:34ff:fef4:26cc self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::2b6:3dff:fe32:5577 self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::2ff:96ff:fe41:1b70 self permanent
00:00:00:00:00:00 dev vx-15 dst fe80::25a:5dff:fed9:de19 self permanent
00:00:00:00:00:00 dev vx-13 dst fe80::2af:aeff:fe58:3cdb self permanent
Maybe we should add a script, that allows to reproduce this issue with less effort in order to have more eyes on it?
Maybe it's not a bridge fdb problem. I just observed on sn01, that the vx-...
interfaces are not added to batman:
sn01:
[root@sn01]:~ # ls -d /sys/class/net/bat* | cut -d '/' -f 5 | xargs -n 1 -I X batctl meshif X if | grep vx
[root@sn01]:~ #
sn09:
[root@sn09]:~ # ls -d /sys/class/net/bat* | cut -d '/' -f 5 | xargs -n 1 -I X batctl meshif X if | grep vx
vx-10: active
vx-11: active
vx-12: active
vx-13: active
vx-14: active
vx-15: active
vx-16: active
vx-17: active
vx-18: active
vx-19: active
vx-20: active
vx-21: active
vx-22: active
vx-23: active
vx-99: active
As quickfix I fired:
ls -d /sys/class/net/bat* | cut -d '/' -f 5 | grep -v bat0 | sed 's_bat__g' | xargs -n 1 -I XX systemctl start add-vx-to-batX@XX.service
My node is now properly connected. But I am not sure whether all problems described in this issue are solved.
I'll look into it tomorrow in the afternoonn
I added the milestone "Beginn der stabilen Phase", as this is likely to be a bug. But as this happens sporadically, I am not sure, whether we will resolve this issue before the "stabile Phase".
I implemented a fix for the mentioned issue in 5fc0673.
But I am not sure whether all problems described in this issue are solved.
If what you did in 5fc0673f31a435ea27903f475aa57de697f96722 is indeed a fix, we need to rewrite wait_for_iface.sh, as it's then broken, right?
I think, the discussed problem here is the same as #175.
It does not make sense to have either #175 or #147 (this issue) as blocker for the infrastructure freeze week, so I'll remove the milestone here.
Today there appeared a similar issue, but this time only the route is missing and the fdb entry is there. Maybe it's related, maybe not...
(Originally reported by @bschelm via Mail.)
WG is established:
root@NDS-PoE-Test1:~# ubus call wgpeerselector.vpn status
{
"peers": {
"sn07": false,
"sn01": false,
"sn09": false,
"sn10": {
"established": 12262
},
"sn05": false
}
}
WG is established:
[root@sn10]:~ # ffh_wg_established.sh | grep dom14
95 dom14 /etc/wireguard/peers-wg/aiyion-JT-OR750i
1819344 dom14 /etc/wireguard/peers-wg/charon
595543 dom14 /etc/wireguard/peers-wg/nds-esperanto
11446 dom14 /etc/wireguard/peers-wg/nds-fwh-tresckowstr-technik-vorne
3 dom14 /etc/wireguard/peers-wg/nds-poe-test1
2268739 dom14 /etc/wireguard/peers-wg/nds-schwule-sau
1684077 dom14 /etc/wireguard/peers-wg/nds-the-dalek-mothership
1643047 dom14 /etc/wireguard/peers-wg/nds-the-tardis
683281 dom14 /etc/wireguard/peers-wg/wgtest-1043-lemoer
IPv6 of the router:
root@NDS-PoE-Test1:~# ip a s vpn
12: vpn: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default qlen 1000
link/none
inet6 fe80::2dc:dfff:fecc:981d/128 scope link
valid_lft forever preferred_lft forever
root@NDS-PoE-Test1:~# ip a s vx_vpn_wired
15: vx_vpn_wired: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1330 qdisc noqueue master bat0 state UNKNOWN group default qlen 1000
link/ether 02:29:04:5d:75:e7 brd ff:ff:ff:ff:ff:ff
inet6 fe80::29:4ff:fe5d:75e7/64 scope link
valid_lft forever preferred_lft forever
But no appropriate route is installed:
[root@sn10]:~ # ip -6 route | grep -i wg-14
fe80::213:18ff:fe6e:f314 dev wg-14 proto static metric 1024 pref medium
fe80::/64 dev wg-14 proto kernel metric 256 pref medium
Bridge fdb entry is ok:
[root@sn10]:~ # bridge fdb list | grep wg-14
1e:bd:8f:52:15:d7 dev vx-14 dst fe80::213:18ff:fe6e:f314 via wg-14 self
02:29:04:5d:75:e7 dev vx-14 dst fe80::2dc:dfff:fecc:981d via wg-14 self
Even if we restart the service, the route is not created...
Some analysis is following:
Here we see that we have 91 peers per interface:
[root@sn10]:~ # wg | grep -e '^[^ ]' | cut -d ' ' -f 1 | uniq -c
1 interface:
91 peer:
1 interface:
91 peer:
1 interface:
91 peer:
1 interface:
91 peer:
1 interface:
91 peer:
1 interface:
91 peer:
1 interface:
91 peer:
1 interface:
91 peer:
1 interface:
91 peer:
1 interface:
91 peer:
1 interface:
91 peer:
1 interface:
91 peer:
1 interface:
91 peer:
1 interface:
91 peer:
1 interface:
91 peer:
A small patch applied to netlink.py:
diff --git a/netlink.py b/netlink.py
index 31a1e76..743dfdb 100644
--- a/netlink.py
+++ b/netlink.py
@@ -97,10 +97,13 @@ class ConfigManager:
with WireGuard() as wg:
clients = wg.info(self.wg_interface)[0].WGDEVICE_A_PEERS.value
+ print(f"LEN: {len(clients)}, iface={self.wg_interface}")
for client in clients:
latest_handshake = client.WGPEER_A_LAST_HANDSHAKE_TIME["tv_sec"]
public_key = client.WGPEER_A_PUBLIC_KEY["value"].decode("utf-8")
+ print(f"A: {public_key}")
+
peer = self.find_by_public_key(public_key)
if len(peer) < 1:
peer = WireGuardPeer(public_key)
Shows only 89 or 90 peers:
[root@sn10]:~ # /usr/bin/python3 /srv/wireguard/vxlan-glue/netlink.py -c /etc/wireguard/netlink_cfg.json | grep LEN
LEN: 90, iface=wg-10
LEN: 90, iface=wg-11
LEN: 90, iface=wg-12
LEN: 90, iface=wg-13
LEN: 89, iface=wg-14
LEN: 89, iface=wg-15
LEN: 89, iface=wg-16
LEN: 90, iface=wg-17
LEN: 89, iface=wg-18
LEN: 90, iface=wg-19
LEN: 90, iface=wg-20
LEN: 90, iface=wg-21
LEN: 90, iface=wg-22
LEN: 90, iface=wg-23
LEN: 90, iface=wg-99
(Even if 90 would only be an off by one discrepancy, this discrepancy would not be consistent, as a few interfaces also have 89 only.)
The recent finding of this week has now been fixed in https://github.com/freifunkh/wireguard-vxlan-glue/commit/7c876de05a30f4ff946065a34d5f9a63555d316f .
Can this be closed as of the last comment? Or is there anything we can / should regularily test (via monitoring)?
I think it reads as if the finding of that week and not the whole issue was resolved. But maybe it has been regardless.
I just found for my router, that the bridge fdb entry for
00:00:00:00:00:00
was missing when I usedbridge fdb
. Only72:4c:e2:db:6f:37 dev vx-99 dst fe80::247:34ff:fef4:26cc via wg-99 self
is visible.Details:
systemctl restart wg_netlink.service
didn't help.00:00:00:00:00:00
is still not existing.We should keep an eye on this.