bilelmsekni / OpenStack-Grizzly-Install-Guide

A full install guide for OpenStack Grizzly
352 stars 218 forks source link

Very slow and unreliable network on instance creation from snapshot #143

Closed mtasic85 closed 10 years ago

mtasic85 commented 10 years ago

I've installed https://github.com/mseknibilel/OpenStack-Grizzly-Install-Guide/blob/OVS_MultiNode/OpenStack_Grizzly_Install_Guide.rst how its explained, and additionally downgraded kernel to 3.2, so OVS can work with it. Launched dozes of virtual machines. It took time, and I created one after another, and in some cases bunch of them at once just to test system. It was mostly ok. Sometimes instances did not get assigned any address; addresses not visible in Horizon.

Most of the time system is responsible, but in some situations like when I create snapshot of a running machine, or when I create instance from snapshot whole network seams dead. I run on controller node "watch quantum agent-list" just to monitor are compute nodes reachable and then chaos starts. First "quantum agent-list" gets very very delayed. Secondly, compute nodes become unreachable. Thirdly, some of them never come back. When I diagnosed issue on each individual compute node, I found out that network interfaces went down, so I had to bring them up using "ifconfig ethX up".

Then, I was very tired of all this trying and experimenting, so I wrote a script that will do this job for me.

Each compute node has hostname "compX", where X is 0, 1, 2, .., 19, depending on compute node where I'm connected.

On compX, edit /root/nurse-network.sh:

#!/bin/bash
TEST_IP=10.10.10.51

while [ 1 ]; do
    if ping -q -c1 $TEST_IP > /dev/null
    then
        echo "YES" > /dev/null
    else
        echo "NO" > /dev/null
        ifconfig eth0 down
        ifconfig eth1 down
        ifconfig eth0 up
        ifconfig eth1 up
    fi

    sleep 10
done

On compX:

# chmod +x /root/nurse-network.sh

On compX, edit /etc/rc.local:

...
/root/nurse-network.sh
...

This resolved issue with network interfaces not coming back.

This is definitely not a normal and expected behavior. I have 1 controller node with 2 NICs, 1 network node with 3 NICs, and 20 (heterogeneous, similar hardware configurations) compute nodes with 2 NICs, and network has 2 very good 1Gb switches.

As far as I understand, https://github.com/mseknibilel/OpenStack-Grizzly-Install-Guide/blob/OVS_MultiNode/OpenStack_Grizzly_Install_Guide.rst explains multi-node installation. But guide is not complete. I will definitely contribute to it when I'm done with experimentation and when I confirm that system is reliable.

I personally believe that one network interface gets nuked by all network traffic. I suspect one on 10.10.10.0/24 network.

Any suggestions?

mtasic85 commented 10 years ago

I have resolved unreliable network issue. Problem was with Realtek driver! Automatically, r8169 was loaded for both eth0 and eth1 interfaces. This driver has given me so much headache. For some reason it completely stops every network interface on compute nodes, so script from above post restarts them on failure. This is unacceptable.

Unfortunately, r8168 does not come with ubuntu 12.04.3 kernel 3.2, so I had to manually build and install it following this tutorial: http://djlab.com/2010/10/fixing-rtl8111-8168b-driver-debian-ubuntu/

Now, I have r8168 loaded for eth0, and r8169 for eth1. Script from previous post is not required anymore.

Still need to test reliability of whole system. I will do more stress test next couple of days and let you know what is going on.

mtasic85 commented 10 years ago

Unfortunately, r8168 and r8169 are terrible drivers/modules from my experience. People complain all over the Internet for last 5 years about it. Many suggested upgrading kernel to latest, but since OVS 1.4.0 and kernel 3.2 are only available at the moment for ubuntu 12.04.3, we will be forced to stay on it. Script from original question works very well, so anyone can use it. We use it.

In script from initial questions, we even tried using "ip link set mtu 1400 dev eth0" and "ip link set mtu 1600 dev eth0", instead of "ifconfig eth0 down" and "ifconfig eth0 up". But, after some time it stops working.

My next direction will be to to install isolated Grizzly with only 3 machines, and try to use latest kernel on 12.04.3 available and make OVS work.

I read almost every possible guide for last 4 versions of OpenStack, and I cannot find any specific setting which in OpenStack can cause network interface to go down, so I can conclude that problem is not in OpenStack.

I know it does not have to anything with official guide, but I see complains all over the forums, and discussions are always getting to dead end. I will keep you posted anyway until we find solution to this problem.

mtasic85 commented 10 years ago

Finally good news,

I have managed to resolve this issue. Indeed, kernel module r8168/r8169 caused all problems. To resolve this update Linux Kernel to 3.8 and OVS to 1.9.0, but there is a catch, so follow next instructions.

Kernel and OVS

On Controller, Network and all Compute nodes do following:

# apt-get -y update
# apt-get -y upgrade
# apt-get -y dist-upgrade

# apt-get install linux-image-generic-lts-raring linux-headers-generic-lts-raring

# update-grub
# update-grub2

# reboot

# apt-get install -y openvswitch-datapath-lts-raring-source
# apt-get install -y module-assistant
# module-assistant prepare
# cd /lib/modules/`uname -r`/build/include/linux
# ln -s ../generated/uapi/linux/version.h .
# module-assistant auto-install openvswitch-datapath-lts-raring

# reboot

References:

OVS and DHCP

Once net0 and all compX machines are installed and configured, stop quantum services on net0 and compX machines.

On net0:

# cd /etc/init.d/; for i in $( ls quantum-* ); do sudo service $i stop; done

On all compX:

# cd /etc/init.d/; for i in $( ls quantum-* ); do sudo service $i stop; done

On ctrl0:

# mysql -u root -p
> use quantum;
> select * from ovs_tunnel_endpoints;
> delete from ovs_tunnel_endpoints;

On net0:

# ovs-vsctl list-br

# ovs-vsctl del-br br-ex
# ovs-vsctl del-br br-int
# ovs-vsctl del-br br-tun

# ovs-vsctl add-br br-int
# ovs-vsctl add-br br-ex
# ovs-vsctl add-port br-ex eth1

# ovs-vsctl list-br

# cd /etc/init.d/; for i in $( ls quantum-* ); do sudo service $i restart; done

# ovs-vsctl list-br

On all compX machines:

# cd /etc/init.d/; for i in $( ls quantum-* ); do sudo service $i restart; done

Then Hard Restart all instances.

References: