Really weird routing issue (related to closed issue #13)

mobiletradingpartners commented 5 years ago

I am having a really strange issue on our test vpncloud. I can provide more data of course but am starting with a brief summary.

I'm trying to build a network on a bunch of bare metal hosts each containing a handful of VMs. It's a variation on the dial-in example, I guess.

Each host has a permanent connection to the internet, and is visible to all other hosts. The VPNcloud configuration for each host lists all the other hosts as peers.

This works fine with one proviso(* below). For example, given the hosts A, B, C:

A is in London
B is in Washington DC
C is in New York

We can ping between these hosts without problems. For example it's a steady 74.5ms between A and B: 64 bytes from 10.224.224.1: icmp_seq=348 ttl=64 time=74.0 ms 64 bytes from 10.224.224.1: icmp_seq=349 ttl=64 time=74.0 ms 64 bytes from 10.224.224.1: icmp_seq=350 ttl=64 time=74.0 ms

However as soon as I start to add the VMs into the vpncloud, it goes pear shaped. I add 5 VMs on the New York server C, each of these is configured so that it's only peer is the vpncloud instance on the bare metal C, using the private address range that I gave to that vpncloud instance. (I originally tried using the local libvirt-allocated addresses, but realised that wasn't smart in case those addresses leaked).

As soon as I make any traffic between these New York VMs across vpncloud (ie local to server C, NOT across the WAN), the ping times between A and B suddenly increase, and they keep increasing. Initially to around 1s, and after 7-8 hours use it looks like this: 64 bytes from 10.224.224.1: icmp_seq=27926 ttl=64 time=8166 ms 64 bytes from 10.224.224.1: icmp_seq=27929 ttl=64 time=8149 ms 64 bytes from 10.224.224.1: icmp_seq=27930 ttl=64 time=7143 ms

Note that this is traffic via hosts A and B; nothing to do with the VMs on server C.

Now as an experiment, and after a fair bit of head scratching, I ran a script to stop vpncloud on the VMs on server C. Watch this ping record, captured while I did this: 64 bytes from 10.224.224.1: icmp_seq=11 ttl=64 time=8537 ms 64 bytes from 10.224.224.1: icmp_seq=12 ttl=64 time=7530 ms 64 bytes from 10.224.224.1: icmp_seq=13 ttl=64 time=6522 ms 64 bytes from 10.224.224.1: icmp_seq=14 ttl=64 time=8520 ms 64 bytes from 10.224.224.1: icmp_seq=15 ttl=64 time=7513 ms <-- began stopping vpncloud on server C VMs here 64 bytes from 10.224.224.1: icmp_seq=16 ttl=64 time=6505 ms 64 bytes from 10.224.224.1: icmp_seq=17 ttl=64 time=5511 ms 64 bytes from 10.224.224.1: icmp_seq=18 ttl=64 time=4504 ms 64 bytes from 10.224.224.1: icmp_seq=19 ttl=64 time=3496 ms 64 bytes from 10.224.224.1: icmp_seq=20 ttl=64 time=5493 ms 64 bytes from 10.224.224.1: icmp_seq=21 ttl=64 time=4486 ms 64 bytes from 10.224.224.1: icmp_seq=22 ttl=64 time=3478 ms 64 bytes from 10.224.224.1: icmp_seq=23 ttl=64 time=2549 ms 64 bytes from 10.224.224.1: icmp_seq=24 ttl=64 time=1542 ms 64 bytes from 10.224.224.1: icmp_seq=25 ttl=64 time=534 ms <-- finished here 64 bytes from 10.224.224.1: icmp_seq=26 ttl=64 time=74.1 ms 64 bytes from 10.224.224.1: icmp_seq=27 ttl=64 time=73.9 ms 64 bytes from 10.224.224.1: icmp_seq=28 ttl=64 time=74.1 ms 64 bytes from 10.224.224.1: icmp_seq=29 ttl=64 time=74.2 ms 64 bytes from 10.224.224.1: icmp_seq=30 ttl=64 time=74.1 ms 64 bytes from 10.224.224.1: icmp_seq=31 ttl=64 time=74.0 ms

and like magic the ping times between A and B are back to normal.

What is relevant is that while the problem exists, the peers list on server B contains entries for the VMs on server C:

SERVER-C:1163 (ttl: 3597 s)
SERVER-C:1158 (ttl: 3597 s)
SERVER-C:1162 (ttl: 3597 s)
SERVER-A:3210 (ttl: 3393 s)
SERVER-B:3210 (ttl: 1830 s)
SERVER-C:1157 (ttl: 3597 s)
SERVER-C:1160 (ttl: 3600 s)
SERVER-C:1161 (ttl: 3597 s) but these peers on SERVER-C (apart from the bare metal instance of vpncloud) are not externally accessible because we only allow inbound udp on port 3210.

The problem does not affect all the bare metal hosts consistently (as we add others to the network). As we turn on the vpncloud on the server C VMs, some point to point links work fine and others suffer the increased latency.

With a nod to issue #13 I note that:

this problem exists whether we use tap/switch or tun/router modes: changing the whole network between these modes, I can replicate the issue either way, so using tun doesn't seem to stop the bad things happening
I am pretty sure that I cannot disable ip forwarding on the hosts (doing so busts libvirt's internal networking) but the problem will happen even if it's not been explicitly enabled.

I know I can probably resolve this by creating separate vpnclouds within each host for it's VMs, but then I would have to manage the routing myself; our fleet is large enough that I prefer not to do that.

So I am thinking that a possible solution would be to add a config setting along the lines of "don't advertise me to other peers"? Or even better, have a two level classification of "local peers" (advertised between one another) and "normal peers" (advertised globally)??

I realise that those are probably not easy changes to make, of course.

= Some of our bare metal hosts are 200-300ms away. Because of the long distances involved, I found that it was necessary to have every bare metal host ping every host across vpncloud on a script, once a minute, to stop them forgetting about one another; increasing the timeouts in the settings file wasn't enough

dswd commented 5 years ago

This is indeed very strange. What version are you using? Some previous versions had a bug that could result in continued packages being sent between nodes as their handshake never ended. Maybe that happened to you if you are using version 0.8. It should be fixed in 0.9.

mobiletradingpartners commented 5 years ago

This is version 0.9.1 on ubuntu 16.04, the hosts are fairly simple, minimal installation except of course libvirt/kvm is running.

It does appear to be engaging in a dialog with unreachable peers, but it's strange that A & B are having this problem between themselves, while not actively needing to communicate with the VMs on C.

dswd commented 5 years ago

It seems that some packets are running in a loop and being sent between sites infinitely. I think there is no packet multiplication happening, as that would cause am immediate breakdown of you network. I guess some kind of packet is running in a circle through your nodes being forwarded from one to the others. As new packets join in periodically, the traffic gets more and more and the latency higher and higher. This would perfectly explain the rampup on latency that you are seeing.

Please attach a tcpdump to the vpncloud interfaces to check what kind of packets are flowing between those nodes, especially those ones on different sites that should not see any traffic. This will help us narrow down the problem to either protocol traffic or payload traffic.

Also, I do not really understand your use case. If you have a set of VMs and you only want them to be able to communicate with the host, a bridge would be much easier than vpncloud. If you do not want those VMs to communicate with other hosts, I would suggest using two separate networks so that those nodes never meet.

mobiletradingpartners commented 5 years ago

Sure, I'll run it up again in a day or two and extract the dumps.

Regarding your use case question, you are right of course. We got to this point trying to solve two unrelated problems. Firstly that we want a reliable (meshed so no single point of failure) backhaul network for management traffic between the VMs, and secondly that the default IP ranges for what we already have in the field overlap ie the VMs on separate hosts share the same IP ranges, so I had the smart idea to use vpncloud to overcome both of those. But we can achieve what we want with just a mesh between the hosts and letting the VMs communicate over the virt bridge, with a bit of extra work on our part (which may include adjusting the default virt bridge IP ranges), so we won't be running into the problem documented above in practice.

dswd commented 5 years ago

Still it would be interesting for me to know the root cause for the problem. So many thanks for helping to debug it.

I currently see the following potential causes in decreasing likelihood:

A misconfiguration of the system routing/bridging causes packets to never reach their destination and instead be looped through the network
An internal error in the vpncloud protocol could cause vpncloud to send infinite chains of messages and replies (e.g. init messages or peer list messages).
A misconfiguration of vpncloud (e.g. switch mode for TUN devices) in combination with system routing might be able to cause those problems

dswd commented 5 years ago

Did you have time to look into the issue? Does the problem still exists?

dswd / vpncloud

Really weird routing issue (related to closed issue #13) #30