cloudius-systems / osv

OSv, a new operating system for the cloud.
osv.io
Other
4.08k stars 602 forks source link

Adding two interfaces, only one getting DHCP-address #740

Open chrisbetz opened 8 years ago

chrisbetz commented 8 years ago

Hi,

I'd like to start my OSv image using two interfaces: eth0 for public network and eth1 for private networking.

I think that's not possible with Capstan alone, so I figured out (with code from run.py and Capstan) to start qemu manually with the proper settings. Things are working well if I start with only public or only private networking (starting with one only this ends up being eth0, obviously), both interfaces do come up and get an IP address from the range given. Fine.

As soon as I do start two networks, only one ('private') is assigned an IP address, the other one is getting "0.0.0.0" ('public'). I believe this is due to the fact, that the DHCP for the public network is taking quite a long time to actually assign an IP, so the private network is getting an IP-address and OSv is progressing, skipping the other interface.

This matches the behaviour during boot, when only the private IP address is reported on the console:

    sudo \                                  
    >       /usr/bin/qemu-system-x86_64 \
    >       -enable-kvm \
    >       -nographic \
    >       -m 1024 \
    >       -smp 2 \
    >       -cpu host,+x2apic \
    >       -device virtio-blk-pci,id=blk0,bootindex=0,drive=hd0 \
    >       -drive file=/root/.capstan/instances/qemu/elasticsearch.osv/disk.qcow2,if=none,id=hd0,aio=native,cache=none \
    >       -chardev stdio,mux=on,id=stdio,signal=off \
    >       -device isa-serial,chardev=stdio \
    >       -netdev bridge,id=hn0,br=br0,helper=/usr/libexec/qemu-bridge-helper \
    >       -device virtio-net-pci,netdev=hn0,id=nic0,mac=1a:3c:05:e9:57:08 \
    >       -netdev bridge,id=hn1,br=br-priv0,helper=/usr/libexec/qemu-bridge-helper \
    >       -device virtio-net-pci,netdev=hn1,id=nic1,mac=d6:d7:04:8b:f0:90 \
    >       -chardev socket,id=charmonitor,path=/root/.capstan/instances/qemu/elasticsearch.osv/osv.monitor,server,nowait \
    >       -mon chardev=charmonitor,id=monitor,mode=control
    OSv v0.24
    eth1: 10.0.10.35
    pthread_setcancelstate() stubbed
    [2016-04-29 09:20:06,083][INFO ][node                     ] [Spider-Man] version[1.5.2], pid[0], build[62ff986/2015-04-27T09:21:06Z]

Any help on this would be really, really appreciated, because we're planning to use OSv as a basis for our production setup (basically Clojure services, Cassandra, Riemann, Elasticsearch, all running on OSv). Because in general, I really like the idea (and most of the execution). Thanks a lot!!!

chrisbetz commented 8 years ago

Oh, and I'm able to start with static IPs, but I'm having trouble connecting (through 'public'). Any help on that would also be greatly appreciated, but I'm sure you'd need way more info on my setup.

nyh commented 8 years ago

Code in loader.cc (OSv's startup code) calls dhcp_start() once. That function calls dhcp_worker::init() once, and that loops over all interfaces and requests a separate IP address for each. This should work...

Could you please add more printouts in dhcp.cc, dhcp_worker::init() and other functions, to see why things are not working as expected?

One thing which looks suspicious to me (I'm not familiar with this code) is that dhcp_worker::init() stops retransmitting discovery packets as soon as one of the interfaces got an IP address. This part looks wrong. However, this can't explain the problem you're seeing when no retransmission is needed (?).

nyh commented 8 years ago

By the way, I didn't try your exact command, but I tried the following (which uses qemu's emulated network, not the bridged networking you used), and it did work:

qemu-system-x86_64 -nographic -m 2G -smp 4 -device virtio-blk-pci,id=blk0,bootindex=0,drive=hd0,scsi=off -drive file=/home/nyh/osv/build/last/usr.img,if=none,id=hd0,cache=none,aio=native -netdev user,id=un0,net=192.168.122.0/24,host=192.168.122.1 -device virtio-net-pci,netdev=un0 -netdev user,id=un1,net=192.168.123.0/24,host=192.168.123.1 -device virtio-net-pci,netdev=un1  -redir tcp:2222::22 -device virtio-rng-pci -enable-kvm -cpu host,+x2apic -chardev stdio,mux=on,id=stdio,signal=off -mon chardev=stdio,mode=readline,default -device isa-serial,chardev=stdio

The result:

OSv v0.24-88-ga082362
eth0: 192.168.122.15
eth1: 192.168.123.15
This image has an empty command line. Nothing to run.

Or with additionally debugging messages (run "scripts/run.py -V" once before your qemu command):

[I/42 dhcp]: Waiting for IP...
[I/250 dhcp]: Server acknowledged IP for interface eth1
eth1: 192.168.123.15
[I/250 dhcp]: Configuring eth1: ip 192.168.123.15 subnet mask 255.255.255.0 gateway 192.168.123.1 MTU 1500
[I/250 dhcp]: Server acknowledged IP for interface eth0
eth0: 192.168.122.15
[I/250 dhcp]: Configuring eth0: ip 192.168.122.15 subnet mask 255.255.255.0 gateway 192.168.122.1 MTU 1500

So it seems OSv's DHCP for two interfaces does work correctly... Maybe something was wrong in your bridged networking setup, somehow?

chrisbetz commented 8 years ago

Great infos, will look into it on Monday. :) Thanks.

chrisbetz commented 8 years ago

Ok, with verbose output I see this happening:

BSD shrinker: unlocked, running
[I/33 dhcp]: Waiting for IP...
[I/33 dhcp]: Waiting for IP...
[I/216 dhcp]: Server acknowledged IP for interface eth1
eth1: 10.0.10.35
[I/216 dhcp]: Configuring eth1: ip 10.0.10.35 subnet mask 255.255.255.0 gateway 0.0.0.0 MTU 1500
Running from /init/00-cmdline: /usr/mgmt/cloud-init.so --file /usr/mgmt/local-init.yaml;

I'm trying to digg deeper, but my setup uses Capstan, building upon base image "cloudius/osv-openjdk8" (v0.24), so in order to add more output or anything else, I'd need to switch my complete setup to custom built images. Not my favored option, but I'll walk that road, if necessary.

One thing coming to my mind: If you do not have a lag in one DHCP (but using qemus included DHCP for both interfaces), you might not experience the problem at all. Just a guess.

@nyh Thanks for looking into that. I really appreciate that :)

nyh commented 8 years ago

On Mon, May 2, 2016 at 11:07 AM, chris_betz notifications@github.com wrote:

Ok, with verbose output I see this happening:

BSD shrinker: unlocked, running [I/33 dhcp]: Waiting for IP... [I/33 dhcp]: Waiting for IP...

I'm curious why we see two of these messages. Did the first one timeout for 3 seconds?

[I/216 dhcp]: Server acknowledged IP for interface eth1 eth1: 10.0.10.35 [I/216 dhcp]: Configuring eth1: ip 10.0.10.35 subnet mask 255.255.255.0 gateway 0.0.0.0 MTU 1500 Running from /init/00-cmdline: /usr/mgmt/cloud-init.so --file /usr/mgmt/local-init.yaml;

Hmm, is anything running after cloud-init? Or does the VM shut down at this point?

As I noted in https://github.com/cloudius-systems/osv/issues/740, the current code incorrectly stops waiting for IP addresses after receiving one. This is why we start running the command after having just one IP address. But if my guess is correct, you will see the second IP address being set later. Do you?

If this is what actually happens (we do get a second IP address later), then probably the wait logic in dhcp_worker::init() is the only thing broken.

One thing coming to my mind: If you do not have a lag in one DHCP (but using qemus included DHCP for both interfaces), you might not experience the problem at all. Just a guess.

Yes, this is why I think the problem might be in the code that waits for these IP addresses to be set. In my setup, both were set so quickly, that regardless of how buggy the wait code would have been it probably wouldn't have mattered.

chrisbetz commented 8 years ago

Ok, short answer: second dhcp answer is never recieved/processed, the IP is defaulted to 0.0.0.0

Sorry, on the run.

chrisbetz commented 8 years ago

Ok, here's some more info: The same setup is totally fine using a CoreOS-VM. I do get two IPs, everything's fine. So it's not a host-based problem.

Answering stuff from above: Yes, my system is running after cloud-init, I'm not shutting down.

Yes, there might be a timeout Waiting for IP.

I'm not getting an IP afterwards. Never ever.