MusicDin / kubitect

Kubitect provides a simple way to set up a highly available Kubernetes cluster across multiple hosts.
https://kubitect.io
Apache License 2.0
146 stars 36 forks source link

Bridge mode fails #11

Closed asimpleidea closed 2 years ago

asimpleidea commented 2 years ago

Hi,

thank you so much for this project, it is really a life saver.

Recently I have been trying to create a bridged network and assign static IPs to all machines but I keep failing with message:

Error: couldn't retrieve IP address of domain id: 7c5e009c-443d-4ba4-a61a-0d9bbf3f61d3. Please check following:
1) is the domain running proplerly?
2) has the network interface an IP address?
3) Networking issues on your libvirt setup?
 4) is DHCP enabled on this Domain's network?
5) if you use bridge network, the domain should have the pkg qemu-agent installed
IMPORTANT: This error is not a terraform libvirt-provider error, but an error caused by your KVM/libvirt infrastructure configuration/setup
 timeout while waiting for state to become 'all-addresses-obtained' (last state: 'waiting-addresses', timeout: 5m0s)

  with module.worker_module["1"].libvirt_domain.vm_domain,
  on modules/vm/vm.tf line 58, in resource "libvirt_domain" "vm_domain":
  58: resource "libvirt_domain" "vm_domain" {

Basically it keeps waiting for ips even though they are assigned statically and times out after 5 minutes:

2022-03-01T10:34:39.842Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] waiting for network address for iface=52:54:00:6C:3C:01
2022-03-01T10:34:39.842Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] qemu-agent used to query interface info
2022-03-01T10:34:39.843Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] Interfaces info obtained with libvirt API:
2022-03-01T10:34:39.843Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: ([]libvirt.DomainInterface) <nil>
2022-03-01T10:34:39.843Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14:
2022-03-01T10:34:39.843Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] ifaces with addresses: []
2022-03-01T10:34:39.843Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] 52:54:00:6C:3C:01 doesn't have IP address(es) yet...
2022-03-01T10:34:39.843Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] IP address not found for iface=52:54:00:6C:3C:01: will try in a while
2022-03-01T10:34:39.843Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [TRACE] Waiting 10s before next try
2022-03-01T10:34:39.880Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] waiting for network address for iface=52:54:00:6C:3C:02
2022-03-01T10:34:39.880Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] qemu-agent used to query interface info
2022-03-01T10:34:39.881Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] Interfaces info obtained with libvirt API:
2022-03-01T10:34:39.881Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: ([]libvirt.DomainInterface) <nil>
2022-03-01T10:34:39.881Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14:
2022-03-01T10:34:39.881Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] ifaces with addresses: []
2022-03-01T10:34:39.881Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] 52:54:00:6C:3C:02 doesn't have IP address(es) yet...
2022-03-01T10:34:39.881Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] IP address not found for iface=52:54:00:6C:3C:02: will try in a while
2022-03-01T10:34:39.881Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [TRACE] Waiting 10s before next try

Does anyone know why this is happening?

MusicDin commented 2 years ago

Hi,

is the bridge interface preconfigured on the host machine?

Even if that is not the case, I will add a check for the bridge device to detect the error earlier.

asimpleidea commented 2 years ago

Yes I followed different approaches, both with netplan and by following the guide in example folder.

This is networkctl status -a:

● 2: ens160
       Link File: /lib/systemd/network/99-default.link
    Network File: /run/systemd/network/10-netplan-ens160.networ
            Type: ether
           State: routable (configuring)
            Path: pci-0000:03:00.0
          Driver: vmxnet3
          Vendor: VMware
           Model: VMXNET3 Ethernet Controller
      HW Address: 00:0c:29:19:84:1c (VMware, Inc.)
         Address: 192.168.1.160
         Gateway: 192.168.1.1
             DNS: 8.8.8.8

[...]

● 4: br18
       Link File: /lib/systemd/network/99-default.link
    Network File: n/a
            Type: ether
           State: routable (unmanaged)
          Driver: bridge
      HW Address: ba:8e:02:dc:4f:b9
         Address: 192.168.1.160
                  fe80::b88e:2ff:fedc:4fb9
         Gateway: 192.168.1.1

and this is /etc/systemd/network/br0-static-ip.network:

[Match]
Name=br18

[Network]
Address=192.168.1.160/24
Gateway=192.168.1.1
DNS=192.168.1.1     # Router's DNS
# DNS=8.8.8.8       # Additional DNS if required

Thanks for the help!

MusicDin commented 2 years ago

I'm not sure what is the correlation between these two interfaces (br18 and ens160) as ens160 is created by VMWare and is not enslaved to the bridge interface.

First make sure that the created bridge is active and that it has been given an IP address (from the given code snippet it seems it is).

To me it seems that bridge device is misconfigured and as a consequence libvirt provider cannot gather IP addresses for virtual machines, but I may be wrong.

For example I would create my bridge interface using netplan as follows:

network:
  version: 2
  bridges:
    br18:
      interfaces:
      - ens160
      dhcp4: true
      dhcp6: false
  ethernets:
    ens160: {}

Can you also provide network section from terraform.tfvars?

asimpleidea commented 2 years ago

One question, do I need to have dhcp4: true in the bridge even though I am assigning static ips in network section? With netplan I created the bridge like this:

network:
  version: 2
  renderer: networkd
  ethernets:
    ens33:
      addresses: [ 192.168.1.26/24 ]
      gateway4: 192.168.1.1
      nameservers:
          addresses:
              - "192.168.1.1"
    ens160: 
      dhcp4: false
      dhcp6: false
  bridges:
    br18:
      dhcp4: false
      dhcp6: false
      nameservers:
          addresses:
              - "192.168.1.1"
      addresses: [ 192.168.1.160/24 ]
      interfaces:
      - ens160

Some relevant parts of terraform.tfvars:

# Network mode (nat, route, bridge) #
network_mode = "bridge"

# Network CIDR (example: 192.168.113.0/24) #
network_cidr = "192.168.1.0/24"

# Network (virtual) bridge #
# Note: For network mode 'bridge', bridge on host needs to preconfigured (example: br0) #
network_bridge = "br18"

# Network gateway (example: 192.168.113.1) #
# Note: If not provided, it will be calculated as first host in network CIDR. #
#       +-> first host of 192.168.113.0/24 is 192.168.113.1 #
#network_gateway = "192.168.113.1"

# Network DNS list (if empty, network gateway is set as a DNS) #
network_dns_list = [
  "192.168.1.1",
  "8.8.8.8"
]

# Other stuf...

master_nodes = [
  {
    id  = 1
    ip  = "192.168.1.150"
    mac = "52:54:00:00:00:10"
  }
]

# Other stuf...
worker_nodes = [
  {
    id  = 1
    ip  = "192.168.1.151"
    mac = "52:54:00:00:00:11"
  }
]

If dhcpv4: true is needed even with static IPs then I will give it one more try that, but I am sure I am doing some other mistakes somewhere.

Thank you so much for your help @MusicDin.

MusicDin commented 2 years ago

You don't need to enable dhcp4 if you don't use it.

Otherwise, both configurations seem valid to me.

How long did you let the script run before you stopped it? If you stop the script too early, it may be that the qemu agent has not yet reported a received IP address. For example, you can sometimes see this when all VMs receive the IP address after exactly 2 minutes. For this reason I recommend you to let the script run until it terminates itself (max. 5 minutes).

Please let me know if this solves your problem or what error is reported at the end?

asimpleidea commented 2 years ago

I always let it run, it terminates on its own after 5 minutes, and the error that I posted on first post appears.

asimpleidea commented 2 years ago

Anyways, I think this has more something to do with the terraform libvirt-provider, I will try to follow some of the related issues on their repository (e.g. https://github.com/dmacvicar/terraform-provider-libvirt/issues/924) and will let you know in case. Thank you! :)

MusicDin commented 2 years ago

I was able to recreate this issue.

For example, I have my network configured as follows: CIDR (for LAN network): 10.10.0.0/20 GW (router's IP): 10.10.0.1

If I enter the following values when creating the cluster, the cluster gets successfully created:

# terraform.tfvars

network_mode = "bridge"
network_bridge = "br0"
network_cidr = "10.10.0.0/20"
network_gateway = "10.10.0.1" # In this case, GW can be omitted
...
master_nodes = [
  {
    id  = 1
    ip  = "10.10.6.5"
  }
]
...
worker_nodes = [
  {
    id  = 1
    ip  = "10.10.6.6"
  }
]

If I enter the wrong GW IP, the addresses are not retrieved and I get the same error message as you. The same thing happens if the wrong network CIDR is specified. For example, if I enter network_cidr = "10.10.0.0/22", I again get the same error as you.

Can you verify that you enterd the correct CIDR and GW?

asimpleidea commented 2 years ago

Will check this out asap, thank you! :)

MusicDin commented 2 years ago

Hi,

can you let me know if the above solved your problem? Thanks.

asimpleidea commented 2 years ago

Hi @MusicDin, so sorry for not replying sooner. I double-checked everything and the values are indeed correct but still had the problem, but after what you wrote I am more convinced that the problem is more a misconfiguration of mine some where else rather than the script itself.

BTW, I have a proxy server and modified your scripts to inject proxy environment variables in the cluster appropriately, and so in nat mode everything works fine. My guess -- but I may be wrong -- is that maybe the qemu agent cannot contact the node because the proxy, at that point, is not configured in the guest yet, and so communication to the host is blocked. Do you think this could be the case?

Anyways, I have reverted to using nat mode as it is still acceptable for my use case for now :)

MusicDin commented 2 years ago

In general, I don't think proxy is a problem because if your bridge interface gets its own IP address, so should the virtual machines. This is just a guess though, as I've no idea how your network is implemented.

I'm still not able to reproduce the issue other than with incorrect values, so it seems to me that it needs further investigation on your end. If the NAT mode is sufficient for your needs, that should do for now.

Please let me know if you have any more questions or information about this problem.

MusicDin commented 2 years ago

One more question @asimpleidea - can you please tell me which hypervisor you're using and which OS image you're installing on the nodes?

asimpleidea commented 2 years ago

I am using ESXi and if I remember correctly the images were Ubuntu 16.04, I may try another time with 20.04 though.

So to conclude, I agree with you that I will have to investigate further and will let you know if I have other news :) Thanks so much @MusicDin !

MusicDin commented 2 years ago

Thanks again for the provided information and opening the issue.

I'm close it for now, but fell free to reopen if you have something new.