apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
2k stars 1.09k forks source link

Network issue with VM running on second host. #7364

Closed Atiqul-Islam closed 1 year ago

Atiqul-Islam commented 1 year ago
ISSUE TYPE
COMPONENT NAME
VPN on newly added hosts.
CLOUDSTACK VERSION
4.17.1
CONFIGURATION
OS / ENVIRONMENT
SUMMARY

I have a Zone on Cloud Stack with network configuration specified above. I also have an isolated network (10.8.0.0/16) on the Zone. I also have a VM running on Zone connected to the isolated network, via a virtual router running on Host 1.

I added a second host to the Zone as well. However, VMs connected to the same network running on the second host cannot cannot access the internet. VMs on the first host does have internet access and works as expected.

STEPS TO REPRODUCE

Note: Management Server and Host 1 is running on Device A Host 2 is running on Device B

Device A network configuration

network:
  version: 2
  renderer: networkd
  ethernets:
       enp107s0:
          dhcp4: false
       enp108s0:
          dhcp4: false
  bridges:
      cloudbr0:
          addresses: [10.4.1.10/16]
          routes:
          - to: 0.0.0.0/0
            via: 10.4.1.1
            metric: 100
          nameservers:
                 addresses: [10.4.1.1]
          interfaces: [enp107s0]
          dhcp4: false
          dhcp6: false
      cloudbr1:
          addresses:  [10.6.1.10/16]
          routes:
          - to: 0.0.0.0/0
            via: 10.6.1.1
            metric: 1000
          nameservers:
                  addresses: [10.4.1.1]
          interfaces: [enp108s0]
          dhcp4: false
          dhcp6: false

Device B network configuration

network:
  version: 2
  renderer: networkd
  ethernets:
       enp107s0:
          dhcp4: false
       enp108s0:
          dhcp4: false
  bridges:
      cloudbr0:
          addresses: [10.4.1.11/16]
          routes:
          - to: 0.0.0.0/0
            via: 10.4.1.1
            metric: 100
          nameservers:
                 addresses: [10.4.1.1]
          interfaces: [enp107s0]
          dhcp4: false
          dhcp6: false
      cloudbr1:
          addresses:  [10.6.1.11/16]
          routes:
          - to: 0.0.0.0/0
            via: 10.6.1.1
            metric: 1000
          nameservers:
                  addresses: [10.4.1.1]
          interfaces: [enp108s0]
          dhcp4: false
          dhcp6: false
EXPECTED RESULTS
Both VM1 and VM2 should have internet connectivity. 
ACTUAL RESULTS
VM1 has internet connectivity, however VM2 does not.
weizhouapache commented 1 year ago

@Atiqul-Islam since you use GRE as isolation method, have you configured openvswitch correctly ?

Atiqul-Islam commented 1 year ago

@weizhouapache Really appreciate the response.

We are using Ubuntu Bridges. I wasn't aware that OpenVSwitch is recommended for GRE.

I updated the post with my network configuration for the respective devices.

Based on the article below, it seems like Ubuntu netplan doesnt support OpenVSwitch. https://www.shapeblue.com/networking-kvm-for-cloudstack-2018-revisit-for-centos7-and-ubuntu-18-04/

I am wondering will any of the other alternatives to GRE work without OpenVSwitch (with ubuntu bridges)?

weizhouapache commented 1 year ago

@weizhouapache Really appreciate the response.

We are using Ubuntu Bridges. I wasn't aware that OpenVSwitch is recommended for GRE.

I updated the post with my network configuration for the respective devices.

Based on the article below, it seems like Ubuntu netplan doesnt support OpenVSwitch. https://www.shapeblue.com/networking-kvm-for-cloudstack-2018-revisit-for-centos7-and-ubuntu-18-04/

I am wondering will any of the other alternatives to GRE work without OpenVSwitch (with ubuntu bridges)?

@Atiqul-Islam netplan supports openvswitch I think.

You can use vlan or vxlan instead of GRE.

Atiqul-Islam commented 1 year ago

@weizhouapache

Thank you for the response. My apologies for my late response.

Currently the system running the management server and the first host is connected to the switch via two ethernet interfaces A (10.4.1.1/16) and B (10.6.1.1/16). Note: Both interface A and interface B is connected to an untagged port on the switch (we are using 802.1q VLAN on the switch and it only has options for tagged and untagged ports). The ethernet interface A is used for creating cloudbr0 and the ethernet interface B is being used to created cloudbr1. cloudbr0 is being used by the Management Network which is using isolation of type VLAN cloudbr1 is being used by the Guest and Public Network whose isolation type is what we are discussing about.

Provided the scenario, correct me if I am wrong, based on my understanding if I used VLAN as my isolation type for the Public and Guest, I am assuming, I will need to connect ethernet interface B associated to cloudbr1 to a trunk or tagged port on the switch. I already tried isolation of type VLAN for the Public and Guest Network and had the same issue where VM on the second host couldn't communicate virtual router on the first host. So I am assuming interface B being connected to an untagged port on the switch was the issue.

However, if I use isolation type of VXLAN, which uses layer 3 UDP packets, ethernet interface B associated to cloudbr1 can be connected to an untagged to access port.

Correct me if I am wrong, provided, my assumptions are correct, I am assuming VXLAN is more likely to work with the existing switch configuration (interface A and interface B being connected to untagged switch ports)

Additionally, I am wondering, is there any additional configuration I have to do the system running the management server and the host or install any additional software on it. As I did try isolation type of VXLAN for the Public and the Guest Network with VLAN isolation type for the Management Network. It seemed like, I was having internet connectivity issues on the VMs (regardless of the host they are on). I tried installing Ubuntu 20.04 server, however, the process for fetching updates from the server and downloading was very slow and would eventually crash. I am really confused as to why I was getting such behavior. I checked, the management server log, however, I didnt notice any error messages.

weizhouapache commented 1 year ago

@weizhouapache

Thank you for the response. My apologies for my late response.

Currently the system running the management server and the first host is connected to the switch via two ethernet interfaces A (10.4.1.1/16) and B (10.6.1.1/16). Note: Both interface A and interface B is connected to an untagged port on the switch (we are using 802.1q VLAN on the switch and it only has options for tagged and untagged ports). The ethernet interface A is used for creating cloudbr0 and the ethernet interface B is being used to created cloudbr1. cloudbr0 is being used by the Management Network which is using isolation of type VLAN cloudbr1 is being used by the Guest and Public Network whose isolation type is what we are discussing about.

Provided the scenario, correct me if I am wrong, based on my understanding if I used VLAN as my isolation type for the Public and Guest, I am assuming, I will need to connect ethernet interface B associated to cloudbr1 to a trunk or tagged port on the switch. I already tried isolation of type VLAN for the Public and Guest Network and had the same issue where VM on the second host couldn't communicate virtual router on the first host. So I am assuming interface B being connected to an untagged port on the switch was the issue.

However, if I use isolation type of VXLAN, which uses layer 3 UDP packets, ethernet interface B associated to cloudbr1 can be connected to an untagged to access port.

Correct me if I am wrong, provided, my assumptions are correct, I am assuming VXLAN is more likely to work with the existing switch configuration (interface A and interface B being connected to untagged switch ports)

Additionally, I am wondering, is there any additional configuration I have to do the system running the management server and the host or install any additional software on it. As I did try isolation type of VXLAN for the Public and the Guest Network with VLAN isolation type for the Management Network. It seemed like, I was having internet connectivity issues on the VMs (regardless of the host they are on). I tried installing Ubuntu 20.04 server, however, the process for fetching updates from the server and downloading was very slow and would eventually crash. I am really confused as to why I was getting such behavior. I checked, the management server log, however, I didnt notice any error messages.

@Atiqul-Islam from what I understand (maybe wrong),

Atiqul-Islam commented 1 year ago

@weizhouapache

Thank you for confirming my assumption. You suggestion aligns with my assumptions.

I don't fully know if I understand, what you mean by the VLAN/VXLAN tag. The VLAN id associated to interface B on the switch (connected to the guest and the public network) is 601

While configuring the advance zone, I left the VLAN id empty and for the public network with an ip range of (10.4.2.1 to 10.4.2.255) I choose VLAN ID ranging from 100 to 200 for the for the guest network.

Note: Physical Devices on both the 10.4.1.1/16 and 10.6.1.1/16 network uses range (10.4.1.2 to 10.4.1.255) and (10.6.1.2 to 10.6.1.255) respectively.

Management network - tagged cloudbr1 - isolation type: VLAN Public and Guest network - tagged cloudbr0 - isolation type: VXLAN

Hope that answers your question, let me know if you have any more questions or if I missed something.

My physical virtual router is configured to connect 10.4.1.1/16 and 10.6.1.1/16 to the internet via the external network 10.1.1.1/24. I am able to access internet and ping google com from the CloudStack Virtual Router in the advance zone isolated network and also from the VMs running on the isolated network when I use isolation of VXLAN.

The issue I am having: Trying to install Ubuntu server 20.04 (https://releases.ubuntu.com/20.04.5/ubuntu-20.04.5-live-server-amd64.iso?_ga=2.143809077.1213128030.1678795300-1783458840.1675265647 ) from results in all types of connection issue during the installation. In most cases updates fails or download speed is really slow to the point that the installation would crash. I tried installing the same image on another system and it works fine.

To me it seems like, the issue is occurring due to an unstable internet connectivity, with VXLAN, as I am able to ping google.com from another VM running a centos template. This leads me to suspect, that I am probably not satisfying some prerequisites for VXLAN isolation or maybe something associated to my advance zone configuration is wrong.

Note: I have all egress rule and firewall rule open on the Cloud Stack isolated network and the virtual router associated to it, but the issue persists.

weizhouapache commented 1 year ago

My physical virtual router is configured to connect 10.4.1.1/16 and 10.6.1.1/16 to the internet via the external network 10.1.1.1/24. I am able to access internet and ping google com from the CloudStack Virtual Router in the advance zone isolated network and also from the VMs running on the isolated network when I use isolation of VXLAN.

@Atiqul-Islam your network configurations looks correct, if as you said,

Do you use nested virtualization (virtual instances as kvm host) ?

Atiqul-Islam commented 1 year ago

@weizhouapache

Appreciate the help. Listed below are answers to your questions.

Additional Tests I performed

Based on what I came across so far, my mtu size for interface B (the interface associated to the VXLAN isolation) is 1500 same as the other interface which is associated to the VLAN. Do you think the mtu size of 1500 is causing the problem.

Also what is the recommended mtu size for VXLAN isolation in Cloud Stack

I tried configuring the mtu size of cloudbr1 to 1550, that seemed to have resolved the issue on the VM running on the same host as the virtual router, I was able to install Ubuntu there without any issues. However now I am having issues with VM on the second host.

Issues I am having right now: The ubuntu installation process on the VM running on the host (without the virtual router) is unable to connect to the internet and as a result is stuck at the curlin command. I find the behavior really weird and unexplainable as I am able to ping google.com and curl google.com from another VM with centos template running on the same host.

Note: I am using the same ISO to install Ubuntu on both the VMs. On the VM instance (associated to the host connected to the virtual router) the installation process runs fine and updates from the internet as required. In the second VM (on the host without the virtual router) the installation process is unable to connect to the internet and gets stuck. Both the VMs are on the same network.

Something I noticed that even though I didn't explicitly mention storage network, the storage network was added to cloudbr0 along with the management network. This makes me wondering if that is causing any issue as the cloudbr0 is connected to an access or untagged port on the switch. But it seems like cloudbr0 is being used for both the management and the storage network.

Let me know if you have any more questions, or would like to look at any network details.

weizhouapache commented 1 year ago

@Atiqul-Islam thanks for the update.

see my replies inline.

@weizhouapache

Appreciate the help. Listed below are answers to your questions.

  • I am able to ping google from the CloudStack Virtual Router.
  • I am able to ping google from the VM I quickly spun up using the centos template that comes with cloudstack.

from what I understand, the centos template runs on the different host as the CloudStack Virtual Router, right ?

  • One of the KVM host and the Management Server is running on 1 bare metal physical ubuntu system, the second host is running on another baremetal physical ubuntu system.

Additional Tests I performed

Based on what I came across so far, my mtu size for interface B (the interface associated to the VXLAN isolation) is 1500 same as the other interface which is associated to the VLAN. Do you think the mtu size of 1500 is causing the problem.

Yes, it is my suspicion. nested environments might have some issues cased by mtu size. From what you said below, it seems mtu size caused the ubuntu installation to be stuck.

Also what is the recommended mtu size for VXLAN isolation in Cloud Stack

I tried configuring the mtu size of cloudbr1 to 1550, that seemed to have resolved the issue on the VM running on the same host as the virtual router, I was able to install Ubuntu there without any issues. However now I am having issues with VM on the second host.

Issues I am having right now: The ubuntu installation process on the VM running on the host (without the virtual router) is unable to connect to the internet and as a result is stuck at the curlin command. I find the behavior really weird and unexplainable as I am able to ping google.com and curl google.com from another VM with centos template running on the same host.

can you check the virtual router during the ubuntu installation ?

since the kvm hosts are running on top of ubuntu physical server, have you configured the mtu size of the virtual nics of the kvm hosts on the physical servers ?

Note: I am using the same ISO to install Ubuntu on both the VMs. On the VM instance (associated to the host connected to the virtual router) the installation process runs fine and updates from the internet as required. In the second VM (on the host without the virtual router) the installation process is unable to connect to the internet and gets stuck. Both the VMs are on the same network.

Something I noticed that even though I didn't explicitly mention storage network, the storage network was added to cloudbr0 along with the management network. This makes me wondering if that is causing any issue as the cloudbr0 is connected to an access or untagged port on the switch. But it seems like cloudbr0 is being used for both the management and the storage network.

that's expected. If storage network is not specified, the management network will be used as storage network.

Let me know if you have any more questions, or would like to look at any network details.

Atiqul-Islam commented 1 year ago

@weizhouapache

My apologies for late response, I was able to resolve the issue after with isolation of type VLAN. The issue with my previous VLAN configuration was my switch ports were not trunked.

On a separate note, we recently came across the limitation of homogeneous CPU on host on CloudStack. I am wondering, does this imply that an existing cluster can be rendered unscalable if the associated hardware is not available in the market? Or is there any way around that? Does it imply that a cluster running 10th gen i7 wont be scalable with the 11th gen i7? Is there any mitigative procedure around it?

DaanHoogland commented 1 year ago

@Atiqul-Islam I am closing this isseu. If you feel this is invalid please reopen or create a new one