Cant boot vm using upstream autotest when nics number get larger

ypu commented 11 years ago

As now autotest always get vm object from env and virtnet is not updated. We will met problem when we run tests like this:

run boot (which will start a guest with one nic)
keep the env file and run boot again but change the cfg file to use two nics(add nics += " nic2" in cfg file)

we will meet this problem as in the second test the prepare work for nics get virtnet object from the pre vm object from last case. :

cmdline for second case: -device virtio-net-pci,mac=9a:e3:e4:e5:e6:e7,id=idfKDrOm,netdev=idhlb9is,bus=pci.0,addr=06 \ -netdev tap,id=idhlb9is,vhost=on,vhostfd=21,fd=20 \ -device virtio-net-pci,mac=9a:cb:cc:cd:ce:cf,id=idXVv625,netdev=idBRKPax,bus=pci.0,addr=07 \ -netdev tap,id=idBRKPax,vhost=on,ifname='t1-PGkoOH',downscript='no' \

and qemu-kvm will failed to start the guest: 20:09:21 INFO | [qemu output] /etc/qemu-ifup: could not launch network script 20:09:22 INFO | [qemu output] qemu: -netdev tap,id=idBRKPax,vhost=on,ifname=t1-PGkoOH,downscript=no: Device 'tap' could not be initialized 20:09:22 INFO | [qemu output](Process terminated with status 1)

But what we expect is guest start with two nics both using tapfd.

ypu commented 11 years ago

@cevich this is the problem of #978 try to resolve We also met another problem with nic in guest_s4 and @humanux will update the details later.

cevich commented 11 years ago

Just to make sure I'm not missing anything, you're starting out clean, no env file and no /tmp/address_pool file correct? All config files are default, except you changed base.cfg: nettype=user to`nettype=bridge``, correct?

Then, you run boot test, a VM instance is created, /tmp/address_pool is created, env is created. Test finishes as PASS.

Next, you change base.cfg: nics += " nic2". Leaving env and /tmp/address_pool in place.

Finally, you run boot test, and it breaks as described above.

Can you confirm the MAC address of the first nic in the first test is exactly the same as the first nic in the second test? After the first boot test finishes, is the VM left running or does it shutdown? Does the second test start or restart the VM? Any idea why the second nic seems to have different parameters than the first, are those being set in config somewhere?

humanux commented 11 years ago

@cevich your step is correct

Can you confirm the MAC address of the first nic in the first test is exactly the same as the first nic in the second test?

yes, the first nic mac is same

After the first boot test finishes, is the VM left running or does it shutdown?

no, the vm is running

Does the second test start or restart the VM?

yes, the second test will restart the vm.

Any idea why the second nic seems to have different parameters than the first, are those being set in config somewhere?

in our env_process preprcocess_vm will using the vm.virtnet directly when create new vm, the virnet of the new vm dosen't do any init.

humanux commented 11 years ago

another issues: @cevich

issue1:

add queues =4 in cfg file, then run 'boot' test
after test finished, modify the cfg file, just delete the line 'queues =4', then run boot test again you will find the vm also boot with queues=4 in second tests.

issue2:

run guest s4 test with queues =4 will raise error like "tap can not init" when resume guest

FengYang commented 11 years ago

On 11/12/2013 08:14 PM, Yiqiao Pu wrote:

As now autotest always get vm object from env and virtnet is not updated. We will met problem when we run tests like this:

run boot (which will start a guest with one nic)

keep the env file and run boot again but change the cfg file to use two nics(add nics += " nic2" in cfg file)

we will meet this problem as in the second test the prepare work for nics get virtnet object from the pre vm object from last case. :

cmdline for second case: -device virtio-net-pci,mac=9a:e3:e4:e5:e6:e7,id=idfKDrOm,netdev=idhlb9is,bus=pci.0,addr=06 \ -netdev tap,id=idhlb9is,vhost=on,vhostfd=21,fd=20 \ -device virtio-net-pci,mac=9a:cb:cc:cd:ce:cf,id=idXVv625,netdev=idBRKPax,bus=pci.0,addr=07 \ -netdev tap,id=idBRKPax,vhost=on,ifname='t1-PGkoOH',downscript='no' \

and qemu-kvm will failed to start the guest: 20:09:21 INFO | [qemu output] /etc/qemu-ifup: could not launch network script 20:09:22 INFO | [qemu output] qemu: -netdev tap,id=idBRKPax,vhost=on,ifname=t1-PGkoOH,downscript=no: Device 'tap' could not be initialized 20:09:22 INFO | qemu output <Process%20terminated%20with%20status%201>

But what we expect is guest start with two nics both using tapfd.

I remember we used to fix similar issue before. VM always try to use old parameter.

Now it happens in network part.

— Reply to this email directly or view it on GitHub https://github.com/autotest/virt-test/issues/1058.

ypu commented 11 years ago

@FengYang The old fix only makes the command line generate function works in the right way, but the prepare work for tapfds are not covered.

cevich commented 11 years ago

On 11/12/2013 09:36 PM, Yunping Zheng wrote:

@cevich your step is correct

Can you confirm the MAC address of the first nic in the first test is exactly the same as the first nic in the second test? yes, the first nic mac is same After the first boot test finishes, is the VM left running or does it shutdown? no, the vm is running Does the second test start or restart the VM? yes, the second test will restart the vm. Any idea why the second nic seems to have different parameters than the first, are those being set in config somewhere? in our env_process preprcocess_vm will using the vm.virtnet directly when create new vm, the virnet of the new vm dosen't do any init.

Excellent, I see how this problem is happening, it is caused by two things:

The local persistent cache /tmp/address_pool is updated by incoming params. Any items removed from params, but still present in cache will not be noticed. Instead, both will be mixed together which is not correct behavior.
Much of the virtnet setup is dependant on code in vm.__init__() but un-pickling does not automatically call it - so we have an invariant problem (something that should be changing but is not).
This is also a problem affecting libvirt testing, but worse-so: The vm's XML definition is never checked, and will additionally clash with address_pool and params.

The bad news is, the problem is embedded deep inside virtnet code and bound up with incorrect responsibility delegation (i.e. virtnet code tries to be "smart" in some areas). The good news is I believe I have a good understanding of what is causing these problems, and am far along into fixing it at the root:

1) Un-tangle responsibility delegation between virtnet and vm classes (done) 2) Update virtnet unittests to verify state is not mixed up (working on this now) 3) Update vm classes & env_process to account for changes (this is where I'll need #978)

You can follow my changes here: https://github.com/cevich/virt-test/compare/autotest:next...virtnet_modernize

cevich commented 11 years ago

@ypu Status: fixed bug from second unittest class, and finished writing third class. Found a few more bugs and fixed all except one. Also remembered I'll need a fourth unittest class for libvirt, but hopefully that one will be easier after having written the first three. Hopefully I can finish up all four by the end of tomorrow, then it's onto making them all work in the VM classes (which hopefully means deleting a lot of crusty code).

High-level of what I'm doing: Re-writing all the virtnet stuff to be more "dumb" and not try to make decisions. Instead, the logic of which networking data to use will be ONLY in vm class. This will make virtnet much simpler to use and easier for vm classes to take the correct actions.

ypu commented 11 years ago

@cevich You are so kind to let me know the status of this problem. Will following the virtnet_modernize branch and do some tests with it. Thanks a lot.

cevich commented 11 years ago

@ypu You said this was critical problem, no? Today I fixed many bugs and finished forth unittest class. However, I run tools/run_unittests.py to check, and found I have broken some other things on accident. I will investigate those, then tomorrow as I hoped, I can start getting all these virtnet changes into the VM classes w00t!

cevich commented 11 years ago

Okay, I just figured out the unittest failures are a timing issue uncovered by my un-related changes. Opened this issue #1074 Now I can start integrating my changes into virt_vm, libvirt_vm, and elsewhere :D

cevich commented 10 years ago

@ypu Sorry, forgot to update this yesterday. I've taken a bunch of bug fixes and small enhancements our of my patchset, and separated them to their own PR #1075 and opened an issue for the unittest concurrency problem I found as #1074. Assuming those can be handled in parallel, now I can focus on integration w/ a smaller set of changes on my branch.

cevich commented 10 years ago

@ypu Oh! I just found the cause for all this! A long while ago when I wrote virtnet, I made vm.__init__() to always call super(VM, self).__init__(...). That was required for the needs_restart() mechanism to work properly. I see now, at some point, somebody removed/reverted that. Though anyway, this code hasn't received much love recently, so I've got out the hatchet and am deleting all the cobwebs I can find :) Then I'll see what breaks and fix it more sensibly. If that approach fails, I've got a backup plan.

cevich commented 10 years ago

@ypu Today I learned of a tricky circular dependency that was causing unrelated unittests to fail, but only when run in parallel. Turns out, it was related to some changes I made in my branch. I had to revert those and re-implement a small section + it's unittest. Still, good to catch such a nasty problem early, yay unittests!

My work from yesterday was committed upstream, simplifying my branch. I started working through the virt_vm class, cleaning and getting it ready for the new virtnet. Tomorrow I will continue that work.

cevich commented 10 years ago

@ldoktor I'm looking at updating virtnet in qemu_vm and libvirt_vm to fix a state-mismatch problem @ypu is running into. However, I've come across a few uses of your qemu_devices, and it reads like it is solving a very similar problem.

That got me thinking, reading through your code, and pondering why qemu_vm should have two completely different object models for devices? I don't see a network-device implementation, but IIRC, you said qemu_devices were fully pickle-able, no? What's your time/feeling/outlook on writing an interface device, then replacing virtnet with it in qemu_kvm and env_process?

N/B: On the libvirt-side, we don't really even need virtnet since our virsh and libvirt_xml modules are mature enough now to take over.

I guess I'm trying to look at the big picture, if both qemu_vmand libvirt_vm could have their own specialized and consolidated models, instead of me trying to make virtnet into a "does-everything" for all vm types. I also think this could mean deleting a lot of old code, and replacing it with a smaller amount of fresh code. If I'm right, this may also be a faster way to get @ypu a working fix. What do you think?

@lmr you remember how tricky this networking stuff is, what's your opinion on dumping virtnet, and using qemu_devices for this?

(I'll keep plugging away at what I'm doing in case the above won't work or isn't a good idea).

ldoktor commented 10 years ago

Hi @cevich, yes, qemu_devices is fully pickable and I'd like to see all related code inside of it. The problem is that I still have a lot to cover in disk/devices area before I can move to other/networking stuffs. Anyway if someone else here would like to pick this subject, I'd support him/her. Anyway if this should work we have to prepare some basics of what is required on autotest and hypervizor-related networking code. You said that libvirt is almost prepared to takeover the control, so it could be a good starting point (please apology my ignorance in case it's already described somewhere, I'm avoiding networking as much as possible)

cevich commented 10 years ago

@ldoktor hahahaha, but will networking avoid you! It is very tricky with many corner cases, so I understand. Though by my reading of what you have already in vm.create and elsewhere, it seems "almost there". The problems we have are:

1) Networking-related params must be parsed in uniform way (for all of virt-tests) 2) Multiple VM's must have different & multiple networking setups including hardware and configs. 3) There must not be accidental clashes of allocated resources (FD's, MAC addresses, etc). 4) When VM instance is recovered from Env, it must use previous networking setup from address_pool (db). 5) Detect when a test's networing params are same or different from existing qemu process / VM instance.

The problem @ypu is having comes from current solutions to No. 2 and 5 (however, these have dependencies on the other items). As I see it currently, qemu_devices already has ability for almost all of these. The only parts that seem to be missing in qemu-devices is No 4. and then using it in vm.needs_restart().

In other words, saving the qemu_devices networking info (outside of Env), then using it again in vm.needs_restart() to check if incoming params match saved representation. Using your more qemu-specific code, instead of my more generic code to do the saving and comparing. Since in the end, this is the failing part for @ypu (virtnet is failing to spot a difference between params and saved-representation).

Do you think you could manage stripping out VirtNetDB, moving that functionality into qemu_devices, and update vm.needs_restart()?

cevich commented 10 years ago

@ldoktor yeah, looking at this more, it really seems to me that qemu_vm doesn't need VirtNetDB at all. It can just parse params into qdevice. Then needs_restart can just check self.devices == other.devices. The main job of VirtNetDB was this and making sure any generated mac addresses aren't handed out twice to different VM instances.

lmr commented 10 years ago

@ldoktor and @cevich, I've decided to push the interim solution by @humanux while you guys debate what is the best solution. After looking at the time line of this issue, I found not admissible that we've been with virt-test broken for so long for anything as trivial as running tests in sequence that change network configuration.

Feel free to revert the change when you have a proper solution.

ypu commented 10 years ago

@lmr thanks a lot for your efficient for helping with this problem. @cevich and @ldoktor Thanks for your discuss and time for resolve this problem. Still looking forward to your solution.

cevich commented 10 years ago

@ypu Okay, after discussing much with @lmr, he is on-board with having qemu-devices take over for a the address_pool part of virtnet in qemu_vm (formerly VirtNetDB). However, I just finished a first pass at migrating qemu_vm to use the new virtnet (+ minimal address_pool db stub). I noticed a lot of side-stepping around virtnet. This is worrying because the more things happen "on the side" the more we risk libvirt params interoperability problems (i.e. the whole point behind having common virtnet interface). So next time, just ask me for help, I'm nearly always willing.

Basically what I've done is simplify virtnet into just a params parser and collection of "helpers" for the vm and qemu-devices to consume. Once this reduction is working, @ldoktor can set about migrating the minimal virtnet address_cache db use over/info qemu-devices (should be relatively simple task).

Here's my latest work (likely broken in some spots): https://github.com/cevich/virt-test/compare/virtnet_reduction

I'll keep hammering away at this effort, to account for just a few small virtnet interface changes. There's bound to be some breakage, but not a lot that I haven't dealt with in the past. I'll keep y'all posted on my progress.

cevich commented 10 years ago

@ypu I've got most of the basic qemu-kvm tests "working", including multi-nic. However, as I know from experience there are likely some corner-cases and odd situations I need to test for. Then I'll need to go over libvirt and get it updated as well (should be much quicker than qemu_vm though). Finally, I'll start going through and polishing up my commits so we can test, test, test, test, test and maybe merge them soonish :)

cevich commented 10 years ago

@ypu Amazingly, after I got past a few unrelated problems, I am running qemu-kvm tests for boot, ping.multi_nics, reboot, and shutdown with no problem and new networking code. I decided to polish up commit messages today, and get PR ready so more testing can be done by other people. While that testing happens, I will work on getting libvirt tests to work again, with new code.

cevich commented 10 years ago

@ypu @lmr @ldoktor Yay! First draft: https://github.com/autotest/virt-test/pull/1217 Please check it out when you have time. I know it's HUGE set of changes, so I'm opening it up early so y'all can pick away at it in more manageable chunks over a longer period of time. I have no expectations of merging this quickly, I've spent a lot of time trying to get this "right" and want the merge to happen that way too. Thanks!

cevich commented 10 years ago

@ldoktor As far as qemu_devices go, let me know what you think about using it to replace the BaseVM's use of VirtnetDB. It's really qemu-kvm specific and probably doesn't need to be in the base module. It's only purpose is to make sure the same vm.instance gets the same mac addresses for it's corresponding nics over a test job. I'm also open to just leaving it this way if you see it as a huge burden to incorporate into qemu_devices, of if you don't think it makes sense to do so.

autotest / virt-test

Cant boot vm using upstream autotest when nics number get larger #1058