WireGuard / wireguard-vyatta-ubnt

WireGuard for Ubiquiti Devices
https://www.wireguard.com/
GNU General Public License v3.0
1.45k stars 68 forks source link

Multiple connections hang at times #146

Open danielschonfeld opened 1 year ago

danielschonfeld commented 1 year ago

Package version

1.0.20220627

Firmware version

2.0.9-hotfix.4

Device

EdgeRouter Lite / PoE - e100

Issue description

I have two connections going to a host. Occasionally my connection to that host will stop working. I see the packets going on eth0 outbound from that host with tcpdump towards my machine, but I never receive them. Normally you'd think the problem is with my machine... but... when I delete the wireguard interfaces and use a new port on the remote machine it starts working again, then I delete interfaces again and reload my old configuration and everything works as it should.

Of interesting to note is the following connections shared by these two machines

My machine = A Remote Machine = B Some other machine not mentioned above = C

Wireguard tunnels setup: A->B A->C

B->A B->C

B is the problematic machine with the above mentioned peculiar behavior. I am not sure if the fact they share a tunnel to C plays a role here but that's the only distinguishing feature I have to make this tunnel setup different than other tunnels I have from A->other wireguard tunnels with similar endpoint equipment on the remote side not exhibiting this problem.

Configuration and log output

No response

dulitz commented 1 year ago

While this isn't an exact match for your fact pattern... often when you see packets leaving one machine and not arriving at another, the issue is that some link on the path has a low MTU and fragmentation is not happening.

If changing the port number fixes the issue, that could suggest that multiple routing paths are in use and only one path has the low MTU. The part that isn't a good match for your facts is that switching back to the old port doesn't cause the problem to recur.

Can you make the problem go away by lowering the MTU of the wireguard interface on B? Does ping of large packets (equal to the size of your current MTU) from B to A or C reliably get responses?

danielschonfeld commented 1 year ago

Can you make the problem go away by lowering the MTU of the wireguard interface on B? Does ping of large packets (equal to the size of your current MTU) from B to A or C reliably get responses?

Unfortunately the problem just starts after a long while of operating fine so it'll be hard for me to test right away. I will try next time it happens. I can tell you though that when it happens pings don't normally work as the handshake doesn't seem to occur.

dulitz commented 1 year ago

I see. When I mentioned pings, I mean pings to the tunnel endpoint (in the "underlay network"), not pings inside the tunnel.

If you are diagnosing a potential path MTU issue -- and I don't know that's what this is but I'm suspicious -- you should characterize the path when it's working and then again when it's not, and look for differences. So do a traceroute (outside the tunnel) to show the path in the underlay network which the encrypted packets traverse. Use ping -s to find the largest packet that will pass. Record this info. Then when it's not working, try the traceroute again and the ping -s again, and see whether it's the same path and the same ping size, or not.

Good luck.

danielschonfeld commented 3 months ago

I have made some progress in gaining insight to this problem. I still don't fully grasp it though, but it's not an MTU issue.

It appears that when conntrack opens a translation on the same ports as used by listening ports on both ends, the problem manifests itself.

Concrete example:

Machine A listens on 56018 this is some linux distro, set to persist the conx and hit the endpoint Machine-B:56019 Machine B listens on 56019, this is a UBNT EdgeOS. is set to persist the conx and hit the endpoint Machine-A:56018

If Machine B conntrack opened the following translation, it all works fine: udp src=Machine-B dst=Machine-A sport=56019 dport=56018 src=Machine-A dst=Machine-B sport=56019 dport=(random port)

If Machine B conntrack opened the following translation, it hangs every now and then and once it hangs it does not recover: udp src=Machine-B dst=Machine-A sport=56019 dport=56018 src=Machine-A dst=Machine-B sport=56019 dport=56018