coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
262 stars 59 forks source link

testing ipv6 and other nontrivial networking scenarios #340

Open cgwalters opened 4 years ago

cgwalters commented 4 years ago

This is related to https://github.com/coreos/fedora-coreos-config/pull/259

Basically we need to beef up our network testing - testing in the initrd specifically.

One major downside of the recent push to use unprivileged qemu for testing is that networking is...hacky. It uses slirp and I'll just summarize things with:

walters@toolbox ~/s/g/s/libslirp> rg 'TODO.*v6'
src/misc.c
27:/* TODO: IPv6 */
236:    /* TODO: IPv6 */

src/slirp.c
756:            /* TODO: IPv6 */
970:/* TODO: IPv6 */
995:/* TODO: IPv6 */
1014:/* TODO: IPv6 */
1092:    /* TODO: IPv6 */

src/socket.c
689:    /* TODO: IPv6 */

src/tcp_subr.c
607:        /* TODO: IPv6 */
939:    /* TODO: IPv6 */

src/tcp_input.c
389:        /* TODO: IPv6 */
606:        /* TODO: IPv6 */

src/udp.c
318:    /* TODO: IPv6 */
walters@toolbox ~/s/g/s/libslirp> 

Specifically a bug I was seeing but didn't chase down is that the slirp stack seemed to not be responding to DHCPv6 requests.

One path we can investigate is using libvirt - specifically real privileged libvirt. That way we're using libvirt networking including dnsmasq etc. which is heavily tested in all sorts of scenarios (including IPv6). I think to add this to our pipeline we'd end up in a nested virt setup, running a pod which runs a FCOS (or other) VM which runs libvirt, and our tests talk to it over qemu+ssh:// or so.

cgwalters commented 4 years ago

Bigger picture, probably what we want is something like:

cgwalters commented 4 years ago

Though alternatively we could try to depend on https://kubevirt.io/ instead of the nested libvirt thingy...

Thinking about that more, a huge benefit of kubevirt would be that we have IaaS like semantics around things like networking but can still use a kube-native flow for managing the VMs. A downside versus libvirt is that a ton of desktop linux users have libvirt easily, very few have kubevirt set up locally.

cgwalters commented 4 years ago

https://github.com/coreos/coreos-assembler/pull/1046

cgwalters commented 4 years ago

For now, my thoughts are to deploy a FCOS VM in a pod exposed as a Kube service in the pipeline; we'd provision libvirt there, and have a ssh key secret - the pipeline would talk to it over qemu+ssh://. To deal with the inevitable leaks of resources, we'd ensure these VMs have a lifetime of at most a day or so.

This would involve nested virt where we own both layers at least.

cgwalters commented 4 years ago

One path we can investigate is using libvirt - specifically real privileged libvirt. That way we're using libvirt networking including dnsmasq etc. which is heavily tested in all sorts of scenarios (including IPv6). I think to add this to our pipeline we'd end up in a nested virt setup, running a pod which runs a FCOS (or other) VM which runs libvirt, and our tests talk to it over qemu+ssh:// or so.

To elaborate a bit on this...the thing is libvirt is really about "pets" by default. Who hasn't had to clean up an old unused stopped VM they were using from 3 months ago on their desktop?

And trying to share a libvirt instance across different CI tests runs into strong risk of conflict around allocating networks, etc. You really end up needing something like what the OpenShift installer is doing with Terraform to tag resources and help you deallocate.

Probably the simplest is to spin up a separate libvirt-enabled VM that is isolated to each pipeline run for CI/CD; this would be somewhat annoying for local development so we could have a path that shortcut that, but then we'd need to ensure the test framework generated "tagged" VM names etc. and not just hardcoded ones.

berrange commented 4 years ago

One path we can investigate is using libvirt - specifically real privileged libvirt. That way we're using libvirt networking including dnsmasq etc. which is heavily tested in all sorts of scenarios (including IPv6). I think to add this to our pipeline we'd end up in a nested virt setup, running a pod which runs a FCOS (or other) VM which runs libvirt, and our tests talk to it over qemu+ssh:// or so.

To elaborate a bit on this...the thing is libvirt is really about "pets" by default. Who hasn't had to clean up an old unused stopped VM they were using from 3 months ago on their desktop?

FWIW, Libvirt isn't intended to be only for "pets". As an alternative to "persistent" guests which are used by traditional virt apps like GNOME Boxes/Virt-manager, where a config saved in /etc/libvirt or $HOME/.libvirt, it also supports a notion of "transient" guests, where there is no configuration file for the guest saved. A transient VM only exists for as long as it is running, and disappears when shutoff. You can also make it force shutoff, when the client which created it quits. The only thing that would be left behind for a transient guest is the log file under /var/log/libvirt/qemu. If that's a problem, we could likely provide a way to have the log file purged on shutoff too.

cgwalters commented 4 years ago

FWIW, Libvirt isn't intended to be only for "pets".

Yes, I qualified this with "by default".

The only thing that would be left behind for a transient guest is the log file under /var/log/libvirt/qemu. If that's a problem, we could likely provide a way to have the log file purged on shutoff too.

Definitely for these types of test scenarios we would want absolutely everything cleaned up. But per above I think by far the simplest would be to regularly spin up and tear down a nested VM for this to avoid all state leakage.

jlebon commented 4 years ago

We had some discussions about this today. There was rough agreement on keeping with the trend of using virt for privileged operations; cosa already requires /dev/kvm and already has code to stand up supermin VMs for privileged operations. So we could: (1) add back a qemu platform (or alternatively add a new libvirt platform) which assumes privs, then (2) have pipelines run e.g. cosa supermin kola -p qemu .... Local devs of course could just run kola directly.

Re. qemu vs libvirt, there was concern that libvirt was higher-level than we may want. Additionally, local devs who do have privs may not want kola fiddling with their libvirt config.