Tunnel won't re-establish after one of the nodes reboots

vnxme commented 4 months ago

Hi @database64128,

I've been using swgp-go for 8 month now. I have 5+ nodes with multiple proxied wireguard tunnels between them. Sometimes the same node acts as both client and server. I'm quite happy with how swgp-go generally works, but I've recently noticed the following problem:

If I recreate a server node (i.e. the one having proxyListen) with the same config file in place (read as I switch it off, wait for a few minutes, then switch on), the client node (i.e. the one having wgListen) pointing to that server node seems to keep sending proxied packets to the server node, but the tunnel won't re-establish (destination WG interface of the server node receives no packets) until I reboot the server/client/both.
It seems to happen sporadically - I can't simulate it and/or determine what factors are relevant other than recreation/reboot and tell you a list of steps to reproduce the problem. What is more, the server/client node logs (info level) have nothing to show at all.

Could you please think whether there is any mechanism in the code prohibiting the tunnel from re-establishing in the circumstances I described? Or are there any debug steps I could follow to tell you more?

Regards,

database64128 commented 4 months ago

until I reboot the server/client/both.

Does it really take rebooting the whole system to fix it? What about only restarting the swgp-go client?

Could you please think whether there is any mechanism in the code prohibiting the tunnel from re-establishing in the circumstances I described?

It's a really simple protocol and I can't think of anything that would have caused this.

Or are there any debug steps I could follow to tell you more?

You could use tcpdump to check if the server actually received any packets from the client.

I suspect there are external factors at play here. You mentioned that you need to wait for a few minutes before starting the server again. During the downtime, the server system likely responds to client packets with ICMP destination port unreachable messages. Hypothetically, there could be some firewall on the path that blocks the client after seeing a certain number of such messages. But I've never seen any setup like this IRL so I'm not really sure if this even is a thing network administers do.

vnxme commented 4 months ago

Thank you for a swift reply!

Does it really take rebooting the whole system to fix it? What about only restarting the swgp-go client?

Actually I restart only a docker instance of swgp-go, not the whole system.

You could use tcpdump to check if the server actually received any packets from the client.

I used tcpdump on the server host system, and I can see the incoming proxied packets from the client swgp-go instance. I don't have a docker image with both tcpdump and swgp-go to check whether the packets come inside the container. But thank you for an idea.

You mentioned that you need to wait for a few minutes before starting the server again.

Yes, it takes some time when I terminate docker containers and recreate them (downloading layers, etc.). I experimented with a simple container reboot - it is fast and I haven't managed to reproduce the problem.

Hypothetically, there could be some firewall on the path that blocks the client after seeing a certain number of such messages.

My instances are basically Linux VPSes, and I have a src-nat rule for outgoing traffic and a dst-nat rule to forward certain ports to the swgp-go containers, nothing else relevant to the proxied wireguard traffic. If there were any firewall rules prohibiting the traffic, the tunnels wouldn't re-establish after a simple reboot.

database64128 commented 4 months ago

Well, I'm not very familiar with container networking and custom NAT rules, but it might still be helpful if you post the related configurations so more people can help with this.

vnxme commented 4 months ago

I continued my experiments and noticed the problem occurs when for some reason unknown to me yet after container recreation some proxied wireguard packets forwarded with dst-nat on the host can't reach the swgp-go container due to some failed/invalid connection state.

So, you were right supposing it's a host/firewall problem. Thank you for you help!

database64128 / swgp-go

Tunnel won't re-establish after one of the nodes reboots #48