OpenNebula / one-apps

Toolchain to build OpenNebula appliances
Apache License 2.0
12 stars 12 forks source link

Improve Default Gateway selection when dual NICs #23

Open Th0masL opened 2 years ago

Th0masL commented 2 years ago

Hi,

I've had an unfortunate bug related to Default Gateway configuration by the one-context.d/loc-10-network initialization scripts on an Ubuntu VM.

First let me explain a bit the setup.

I'm running an OpenNebula cluster of 10 Physical Host, that has been installed 1 year ago, and that is working totally normally.

Each VM running in this cluster is connected to 2 Virtual Networks:

Because the Private Network does not have internet connectivity, I had to deleted the GATEWAY attribute of the Private Network from OpenNebula Sunstone UI > Networks > Virtual Networks > MyPrivateNetwork to ensure that the one-context.d/loc-10-network initialization scripts would only see the Public Network's Gateway, and therefore use it as Default Gateway in the VMs to ensure correct internet connectivity.

Everything has been working perfectly fine for 1 year, until this week.

What happened this week is that I had to do a maintenance on a Physical Host, so I restarted it.

When the Physical Host restarted, for some reasons, the GATEWAY attribute that I had originally deleted 1 year ago from the Private Virtual Network came back, and became visible again in OpenNebula Sunstone UI > Networks > Virtual Networks > MyPrivateNetwork.

Basically, it looks like this Physical Host had never restarted since the installation of OpenNebula 1 year ago, and it seems like the deletion of the GATEWAY attribute that happened 1 year ago during the original cluster setup did not got saved correctly on this very specific OpenNebula Physical Host that I restarted, and came back after the reboot.

This is a totally unexpected bug, that happened randomly and that is very hard to explain. I have opened a ticket on the OpenNebula/one/5910 repository, but I doubt I have enough material to reproduce the problem.

Anyway, the presence of two ETHx_GATEWAY attribute in the VM Context Variables made the one-context.d/loc-10-network network initialization scripts to pick the Private Gateway IP as default gateway instead of the usual Public Gateway, leading to a total loss of connectivity on the VM and leading to a serious network outage on my infrastructure.

So my question is : Could we implement a better logic around selecting the Default Gateway in the one-context.d/loc-10-network network initialization scripts in a way that it would always default to the best possible Gateway when there are multiple Gateway provided in the Context Variables ?

For example we could expect that ideally it would, in the order of highest likelihood to be the right configuration:

If only one NIC:

If two or more NICs:

Is there any other ways we can ensure a better choice on the Default Gateway selection ?

Thanks

Thomas

baby-gnu commented 2 years ago

I think you should set the default gateway, in order:

  1. for each interfaces defined in a context variable like GATEWAY_INTERFACES, or maybe GATEWAY_V4_INTERFACES and GATEWAY_V6_INTERFACES (we have VMs with 4 or 5 interfaces)
  2. for the first interface ETH0_GATEWAY as a fallback default value.

Regards.