OpenNebula / one-apps

Toolchain to build OpenNebula appliances
Apache License 2.0
12 stars 9 forks source link

VNF on OneKE appliance doesn't NAT (or get any NAT related info) #89

Open kCyborg opened 2 months ago

kCyborg commented 2 months ago

Description Once our team try instantiate the OneKE appliance (both the normal and the airgaped version) available from the public Opennebula marketplace the VNF doesn't get any NAT rule, thus making the communication between the public network to the VNF and then to the private k8s cluster unavailable :-(

To Reproduce

  1. Download the Appliance
  2. Create 2 networks (one will be the public (in our case a real public IP) and the other will be the private (in this example we will use 192.168.10.1/24) )
  3. Instantiate the appliance under the Service tab using:

image image image image

Please, note that we configure the MetalLB, but we also tried without the MetalLB and some other simple tweaks but got no NAT on VNF We also tried in different OpenNebula versions (6.4.X and 6.8.2)

Expected behavior A working k8s cluster

Details If we go into the VNF via SSH and check the logs on /var/log/one-appliance/one-failover.log we got:

image

Telling us that the VRouter failed, but it doesn't say the why :-(

But, if we check on the /var/log/one-appliance/configure.log, we got:

image

Informing us that the /etc/iptables/rules-save file was created, but if we try to open the file, the file is indeed empty:

image

And if we check the iptables:

vrouter:~# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

And:

vrouter:~# iptables -L -t nat
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination

And if we try with the recommended command:

iptables -t nat -vnL NAT4
iptables: No chain/target/match by that name.

We got nothing :-(

If we try to get to the public network from the master, storage or slave nodes, (which have the DNS server (at /etc/resolv.conf) pointing to the private IP of the VNF node) we got no answer from the internet, meaning those k8s nodes can't get anything from the internet.

Additional context We don't really know if the problem it's indeed VNF or if we are setting a wrong configuration, as the documentation doesn't give much :-(

Progress Status

Franco-Sparrow commented 2 months ago

I confirm the issue

sk4zuzu commented 2 months ago

Hi,

the one-apps repo is the correct place to report VR, OneKE related issues. :point_up: :relieved:

iptables -t nat -vnL NAT4 iptables: No chain/target/match by that name.

Thanks, I've corrected the command in the docs it was a simple typo.

In general, your OneFlow configuration looks OK and something similar to it seems to be working in my environments.

When one-failover service "fails" it always tries to bring down every VR module possible, hence NAT (and everything else) is disabled. There must be a reason keepalived returned FAULT state through the VRRP fifo. If you could examine /var/log/messages, maybe you could find some hint what is going on with keepalived, also you could take a look at /etc/keepalived/conf.d/*.conf files to see if everything looks OK, something like:

vrrp_sync_group VRouter {
    group {
        ETH1
    }
}
vrrp_instance ETH1 {
    state             BACKUP
    interface         eth1
    virtual_router_id 2
    priority          100
    advert_int        1
    virtual_ipaddress {
        192.168.10.2/32 dev eth0
        192.168.10.1/32 dev eth1
    }
    virtual_routes {
    }
}

:thinking:

kCyborg commented 2 months ago

Hi there @sk4zuzu, thanks you for your answer mate.

First, sorry for ask in the wrong repo bro, it won't happen again.


Answering you:

  1. My VNFs (several tries) all have the same keepalive informatio messages in the /var/log/messages/:
vrouter:~# cat /var/log/messages | grep keep
Apr 29 23:34:41 vrouter local3.debug one-contextd: Script loc-15-keepalived: Starting ...
Apr 29 23:34:41 vrouter local3.debug one-contextd: Script loc-15-keepalived: Finished with exit code 0
Apr 29 23:34:44 vrouter daemon.info Keepalived[2704]: WARNING - keepalived was built for newer Linux 6.3.0, running on Linux 6.1.78-0-virt OpenNebula/one#1-Alpine SMP PREEMPT_DYNAMIC Wed, 21 Feb 2024 08:19:22 +0000
Apr 29 23:34:44 vrouter daemon.info Keepalived[2704]: Command line: '/usr/sbin/keepalived' '--dont-fork' '--use-file=/etc/keepalived/keepalived.conf'
Apr 29 23:34:44 vrouter daemon.info Keepalived[2704]: Configuration file /etc/keepalived/keepalived.conf
Apr 29 23:34:44 vrouter daemon.info Keepalived[2704]: Script user 'keepalived_script' does not exist
Apr 29 23:34:44 vrouter daemon.info Keepalived[2704]: NOTICE: setting config option max_auto_priority should result in better keepalived performance
Apr 29 23:34:44 vrouter daemon.info Keepalived[2704]: Configuration file /etc/keepalived/keepalived.conf
Apr 29 23:34:44 vrouter daemon.info Keepalived[2704]: Script user 'keepalived_script' does not exist
Apr 29 23:34:44 vrouter daemon.info Keepalived_vrrp[2818]: Script user 'keepalived_script' does not exist

It's somehow screaming at me that the _user 'keepalivedscript' does not exist. But I guess this is not needed, cuz I understand that the screamed user 'keepalived_script' is only needed for scripts in the post/pre keepalive services.

  1. If I check on the /etc/keepalived/conf.d/vrrp.conf I got:
cat /etc/keepalived/conf.d/vrrp.conf
vrrp_sync_group VRouter {
    group {
        ETH1
    }
}
vrrp_instance ETH1 {
    state             BACKUP
    interface         eth1
    virtual_router_id 17
    priority          100
    advert_int        1
    virtual_ipaddress {
        10.1.0.11/26 dev eth0
        10.1.0.10/24 dev eth1
    }
    virtual_routes {
    }
}

Note, the IP may be different than in the last example I send you, cuz I have tried more than once network configuration

Which differ from the /etc/keepalived/conf.d/*.conf you send, if you see at the lines:

     virtual_ipaddress {
        10.1.0.11/26 dev eth0
        10.1.0.10/24 dev eth1
    }

In the cidr way, in yours is /32 in boths NICs, in my case are /26 for eth0 and /24 for eth1.

kCyborg commented 2 months ago

Hi there @sk4zuzu, I think I have found the problem (not the solution tho, sorry)

At the time we instantiate the cluster we define the Control Plane Endpoint VIP (IPv4) and Default Gateway VIP (IPv4):

image

Note: I use to work with OneKE like 1 year ago and those variables needed to be set manually.

If I left those variables blank (empty) it will run the cluster without a problem, regardless the network I use. I.E.:

image (I created a simple cluster with just a master, a worker and the aforementioned VNF)


Let me explain myself:

  1. I created a network using a simple private network template:
AUTOMATIC_VLAN_ID = "YES"
CLUSTER_IDS = "100"
PHYDEV = "bond0"
VN_MAD = "802.1Q"

The already created private network:

image image image

  1. Then, at the instantiation time I set the above-mentioned variables:

image


Am I doing it wrong?

rsmontero commented 2 months ago

@kCyborg + @sk4zuzu issue has been transferred to the right repo

sk4zuzu commented 2 months ago

Hi @kCyborg

First, sorry for ask in the wrong repo bro, it won't happen again.

It's fine man :) @rsmontero already saved us.. :point_up: :relieved:

As for the example with 10.0.0.0 subnet, it seems you set the first VIP to 10.0.0.2 and then the same 10.0.0.2 address was used to create the master node. That can't work.

The first VIP address should preferably be from the public VNET (should work with the private VNET as well), but it has to be from outside the AR you use to deploy cluster nodes so there is no conflict on the IP protocol level.