Open Th0masL opened 2 years ago
I think you should set the default gateway, in order:
GATEWAY_INTERFACES
, or maybe GATEWAY_V4_INTERFACES
and GATEWAY_V6_INTERFACES
(we have VMs with 4 or 5 interfaces)ETH0_GATEWAY
as a fallback default value.Regards.
Hi,
I've had an unfortunate bug related to Default Gateway configuration by the
one-context.d/loc-10-network
initialization scripts on an Ubuntu VM.First let me explain a bit the setup.
I'm running an OpenNebula cluster of 10 Physical Host, that has been installed 1 year ago, and that is working totally normally.
Each VM running in this cluster is connected to 2 Virtual Networks:
Because the Private Network does not have internet connectivity, I had to deleted the
GATEWAY
attribute of the Private Network fromOpenNebula Sunstone UI > Networks > Virtual Networks > MyPrivateNetwork
to ensure that theone-context.d/loc-10-network
initialization scripts would only see the Public Network's Gateway, and therefore use it as Default Gateway in the VMs to ensure correct internet connectivity.Everything has been working perfectly fine for 1 year, until this week.
What happened this week is that I had to do a maintenance on a Physical Host, so I restarted it.
When the Physical Host restarted, for some reasons, the
GATEWAY
attribute that I had originally deleted 1 year ago from the Private Virtual Network came back, and became visible again inOpenNebula Sunstone UI > Networks > Virtual Networks > MyPrivateNetwork
.Basically, it looks like this Physical Host had never restarted since the installation of OpenNebula 1 year ago, and it seems like the deletion of the
GATEWAY
attribute that happened 1 year ago during the original cluster setup did not got saved correctly on this very specific OpenNebula Physical Host that I restarted, and came back after the reboot.This is a totally unexpected bug, that happened randomly and that is very hard to explain. I have opened a ticket on the OpenNebula/one/5910 repository, but I doubt I have enough material to reproduce the problem.
Anyway, the presence of two
ETHx_GATEWAY
attribute in the VM Context Variables made theone-context.d/loc-10-network
network initialization scripts to pick the Private Gateway IP as default gateway instead of the usual Public Gateway, leading to a total loss of connectivity on the VM and leading to a serious network outage on my infrastructure.So my question is : Could we implement a better logic around selecting the Default Gateway in the
one-context.d/loc-10-network
network initialization scripts in a way that it would always default to the best possible Gateway when there are multiple Gateway provided in the Context Variables ?For example we could expect that ideally it would, in the order of highest likelihood to be the right configuration:
If only one NIC:
If two or more NICs:
ETHx_METRIC
value, if provided (obvious)Is there any other ways we can ensure a better choice on the Default Gateway selection ?
Thanks
Thomas