routers apparently prefer fastd

AiyionPrime commented 3 years ago

@CodeFetch and @bschelm observed, routers tend to like connection via fastd, rather then wireguard.

@CodeFetch further found this to be connected to packetloss in wireguard.

We need statistics to back these theses up.

bschelm commented 3 years ago

A router that has a WG-connection and several wifi mesh partners seemed to have lost the connection to WG, although in the status page of the router, it shows still connected to the WG supernode. However, that router did not or could not use that WG-connection but instead routed via wifi mesh.

What I tried is, disable wifi for 5 minutes via "wifi down ; sleep 300 ; wifi" in order to force the router to user the WG-connection instead of the wifi mesh way. Didn't work. Router was offline for 5 minutes.

What helped, was a restart of WG with "ifdown vpn ; sleep 5 ; ifup vpn"

lemoer commented 3 years ago

Hi Bernd,

thanks for the description. I would like to collect some more information:

When did this happen?
How often do you observe this?
Does wg connect to different supernodes when you run "ifdown vpn ; sleep 5 ; ifup vpn"? (The supernode should be chosen randomly here.)
If it (randomly) picks the same supernode as before, is the problem still existing then?
Is this happening with all supernodes?
Could you please check with "batctl n", if the node still sees a batman neighbor on the "vx_vpn_wired" interface (in case of the error)?

On Thu, 25 Feb, 2021, 20:41 Bernd Schittenhelm, notifications@github.com wrote:

A router that has a WG-connection and several wifi mesh partners seemed to have lost the connection to WG although in the status page it shows still connected. However, that router did not or could not use that WG-connection but routed via wifi mesh.

What I tried is, disable wifi for 5 minutes via "wifi down ; sleep 300 ; wifi" in order to force the router to user the WG-connection instead of the wifi mesh way. Didn't work. Router was offline for 5 minutes.

What helped, was a restart of WG with "ifdown vpn ; sleep 5 ; ifup vpn"

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/freifunkh/ansible/issues/175#issuecomment-786153136, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAESYQMXBKEIKRGUX5TS6YDTA2RV5ANCNFSM4YHCMHFQ .

bschelm commented 3 years ago

I would have to wait for another occasion. It happened twice already. I can't tell when it happened because the router, in that case, is still online via mesh. You see it only when you click on the router. After restarting WG, it connected to a different SN.

lemoer commented 3 years ago

I added a graph in the router dashboard in Grafana at the very bottom, which shows the vpn neighbors.

https://stats.ffh.zone/d/000000021/router-fur-meshviewer?orgId=1

@bschelm: Can you have a look, whether the outages are visible there?

On Fri, 26 Feb, 2021, 10:12 Bernd Schittenhelm, notifications@github.com wrote:

I would have to wait for another occasion. It happened twice already. I can't tell when it happened because the router, in that case, is still online via mesh. You see it only when you click on the router. After restarting WG, it connected to a different SN.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/freifunkh/ansible/issues/175#issuecomment-786515106, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAESYQN5X3ET33WRMVWTSETTA5QYFANCNFSM4YHCMHFQ .

bschelm commented 3 years ago

Nope. VPN-Neighbours is always zero. Same on my router.

lemoer commented 3 years ago

Screenshot from 2021-02-27 14-14-54

@bschelm I added another graph to the dashboard. It's quite messy, so I selected some traces and posted a screenshot above. The selected traces contain rx TQ from and tx TQ to the supernodes. Are your outages correlated to the gaps in the graph?

lemoer commented 3 years ago

Well, the time range is kinda long. Here is a more detailed screenshot of the recent history:

Screenshot from 2021-02-27 14-24-31

lemoer commented 3 years ago

From all what I have heard, this doesn't happen very often. So let's start with our Infrastructure Freeze Week, and see whether it will occur again in that week. If it happens again, please do not "fix" it directly, but collect as many data as possible:

output of batctl n from the router
output of batctl meshif bat14 n from the connected supernode
output of wg show from the router
output of wg show from the connected supernode
screenshot of the status page of the router
ip -6 route from the router
ip -6 route from the supernode
20 seconds of tcpdump -n -i vpn inbound -w /tmp/test1.pcap from the router (collect it via scp)
20 seconds of tcpdump -n -i vpn outbound -w /tmp/test2.pcap from the router (collect it via scp)
20 seconds of tcpdump -n -i vx_vpn_wired inbound -w /tmp/test3.pcap from the router (collect it via scp)
20 seconds of tcpdump -n -i vx_vpn_wired outbound -w /tmp/test4.pcap from the router (collect it via scp)
20 seconds of tcpdump -n -i br-wan inbound -w /tmp/test5.pcap from the router (collect it via scp)
20 seconds of tcpdump -n -i br-wan outbound -w /tmp/test6.pcap from the router (collect it via scp)
20 seconds of tcpdump -n -i vx-14 inbound -w /root/test7.pcap from the supernode (collect it via scp)
20 seconds of tcpdump -n -i vx-14 outbound -w /root/test8.pcap from the supernode (collect it via scp)
20 seconds of tcpdump -n -i wg-14 inbound -w /root/test9.pcap from the supernode (collect it via scp)
20 seconds of tcpdump -n -i wg-14 outbound -w /root/test10.pcap from the supernode (collect it via scp)
output of bridge fdb show | grep vx from the connected supernode
output of logread from the router
output of uci export from the router
output of ip addr show from the router
Find the exact time, when the problem has started.

Hopefully this data will be enough to find the issue.

lemoer commented 3 years ago

I think, this is the same issue as #147 .

lemoer commented 3 years ago

It does not make sense to have either #175 (this issue) or #147 as blocker for the infrastructure freeze week, so I'll remove the milestone here.

AiyionPrime commented 3 years ago

I think, this is the same issue as #147 .

I don't remember exactly why, but we came to the conclusion it wasn't; maybe @1977er remembers this better, but I think it was due to some fixes applied on sn09, which did not correlate to resolving this issue.

lemoer commented 1 year ago

Is this still an issue?

AiyionPrime commented 1 year ago

We still have both WireGuard and fastd nodes and have not yet resolved the issue.

lemoer commented 1 year ago

Is there any setup, where we saw this recently?

CC: @bschelm?

Jan-Niklas Burfeind @.***> schrieb am Mo., 17. Apr. 2023, 00:00:

We still have both WireGuard and fastd nodes and have not yet resolved the issue.

— Reply to this email directly, view it on GitHub https://github.com/freifunkh/ansible/issues/175#issuecomment-1510499886, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAESYQNM2VUIPELNLEAWY6TXBRTX5ANCNFSM4YHCMHFQ . You are receiving this because you commented.Message ID: @.***>

freifunkh / ansible

routers apparently prefer fastd #175