coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
262 stars 59 forks source link

vmware: support skipping DHCP in initramfs #518

Closed remoe closed 4 years ago

remoe commented 4 years ago

Is it currently really the case that when one define two static networks like:

FCOS Version: 31.20200505.3.0-vmware.x86_64

ETH0:

---
variant: fcos
version: 1.0.0
storage:
  files:
    - path: /etc/NetworkManager/system-connections/${iface}.nmconnection
      mode: 0600
      overwrite: true
      contents:
        inline: |
          [connection]
          id=${iface}
          type=ethernet
          autoconnect=true
          interface-name=${iface}
          match-device=interface-name:${iface}

          [ipv4]
          method=manual
          addresses=${cidr}
          dns=127.0.0.1

          [ipv6]
          addr-gen-mode=stable-privacy
          dns-search=
          method=ignore

ETH1:

---
variant: fcos
version: 1.0.0
storage:
  files:
    - path: /etc/NetworkManager/system-connections/${iface}.nmconnection
      mode: 0600
      overwrite: true
      contents:
        inline: |
          [connection]
          type=ethernet
          id=${iface}
          autoconnect=true
          interface-name=${iface}
          match-device=interface-name:${iface}

          [ipv4]
          method=manual
          addresses=${cidr}
          dns=${dns}
          gateway=${gateway}

          [ipv6]
          addr-gen-mode=stable-privacy
          dns-search=
          method=ignore

Then the NetworkManager do this at boot ?:

[   49.032590] NetworkManager[541]: <warn>  [1591289223.7928] dhcp4 (eth0): request timed out
[   49.033126] NetworkManager[541]: <info>  [1591289223.7929] dhcp4 (eth0): state changed unknown -> timeout
[   49.038016] NetworkManager[541]: <info>  [1591289223.7986] dhcp4 (eth0): canceled DHCP transaction
[   49.038377] NetworkManager[541]: <info>  [1591289223.7986] dhcp4 (eth0): state changed timeout -> done
[   49.038741] NetworkManager[541]: <warn>  [1591289223.7987] dhcp4 (eth1): request timed out
[   49.039096] NetworkManager[541]: <info>  [1591289223.7987] dhcp4 (eth1): state changed unknown -> timeout
[   49.048341] NetworkManager[541]: <info>  [1591289223.8066] dhcp4 (eth1): canceled DHCP transaction
[   49.048726] NetworkManager[541]: <info>  [1591289223.8067] dhcp4 (eth1): state changed timeout -> done
[   49.049122] NetworkManager[541]: <info>  [1591289223.8067] device (eth0): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')
[   49.049688] NetworkManager[541]: <warn>  [1591289223.8071] device (eth0): Activation: failed for connection 'Wired Connection'
[   49.050134] NetworkManager[541]: <info>  [1591289223.8072] device (eth0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
[   49.050626] NetworkManager[541]: <info>  [1591289223.8080] policy: set 'Wired Connection' (eth1) as default for IPv6 routing and DNS
[   49.051081] NetworkManager[541]: <info>  [1591289223.8082] policy: auto-activating connection 'Wired Connection' (a863e20e-02bb-411a-8596-0330cee8cae1)
[   49.051573] NetworkManager[541]: <info>  [1591289223.8085] device (eth0): Activation: starting connection 'Wired Connection' (a863e20e-02bb-411a-8596-0330cee8cae1)
[   49.052122] NetworkManager[541]: <info>  [1591289223.8086] device (eth0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
[   49.052624] NetworkManager[541]: <info>  [1591289223.8086] device (eth0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
[   49.053123] NetworkManager[541]: <info>  [1591289223.8088] device (eth0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
[   49.053612] NetworkManager[541]: <info>  [1591289223.8089] dhcp4 (eth0): activation: beginning transaction (timeout in 45 seconds)
[   50.022423] NetworkManager[541]: <warn>  [1591289224.7830] dhcp6 (eth1): request timed out
[   50.022866] NetworkManager[541]: <info>  [1591289224.7830] dhcp6 (eth1): state changed unknown -> timeout
[   50.023259] NetworkManager[541]: <info>  [1591289224.7831] dhcp6 (eth1): canceled DHCP transaction
[   50.023625] NetworkManager[541]: <info>  [1591289224.7831] dhcp6 (eth1): state changed timeout -> done
[   50.024014] NetworkManager[541]: <info>  [1591289224.7832] device (eth1): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')
[   50.024586] NetworkManager[541]: <warn>  [1591289224.7836] device (eth1): Activation: failed for connection 'Wired Connection'
[   50.025103] NetworkManager[541]: <info>  [1591289224.7837] device (eth1): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
[   50.025652] NetworkManager[541]: <info>  [1591289224.7844] policy: set-hostname: set hostname to 'localhost.localdomain' (no default device)
[   50.026134] NetworkManager[541]: <info>  [1591289224.7845] policy: auto-activating connection 'Wired Connection' (a863e20e-02bb-411a-8596-0330cee8cae1)
[   50.026623] NetworkManager[541]: <info>  [1591289224.7848] device (eth1): Activation: starting connection 'Wired Connection' (a863e20e-02bb-411a-8596-0330cee8cae1)
[   50.027449] NetworkManager[541]: <info>  [1591289224.7848] device (eth1): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
[   50.027960] NetworkManager[541]: <info>  [1591289224.7849] device (eth1): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
[   50.028438] NetworkManager[541]: <info>  [1591289224.7851] device (eth1): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
[   50.028949] NetworkManager[541]: <info>  [1591289224.7851] dhcp4 (eth1): activation: beginning transaction (timeout in 45 seconds)
[   52.023437] NetworkManager[541]: <info>  [1591289226.7837] dhcp6 (eth1): activation: beginning transaction (timeout in 45 seconds)
[   94.027287] NetworkManager[541]: <warn>  [1591289268.7874] dhcp4 (eth0): request timed out
[   94.027753] NetworkManager[541]: <info>  [1591289268.7875] dhcp4 (eth0): state changed unknown -> timeout
[   94.030035] NetworkManager[541]: <info>  [1591289268.7906] dhcp4 (eth0): canceled DHCP transaction
[   94.030415] NetworkManager[541]: <info>  [1591289268.7906] dhcp4 (eth0): state changed timeout -> done
[   94.030780] NetworkManager[541]: <info>  [1591289268.7907] device (eth0): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')
[   94.031402] NetworkManager[541]: <warn>  [1591289268.7912] device (eth0): Activation: failed for connection 'Wired Connection'
[   94.031900] NetworkManager[541]: <info>  [1591289268.7913] device (eth0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
[   94.032450] NetworkManager[541]: <info>  [1591289268.7921] policy: set 'Wired Connection' (eth1) as default for IPv6 routing and DNS
[   94.032887] NetworkManager[541]: <info>  [1591289268.7923] policy: auto-activating connection 'Wired Connection' (a863e20e-02bb-411a-8596-0330cee8cae1)
[   94.033399] NetworkManager[541]: <info>  [1591289268.7925] device (eth0): Activation: starting connection 'Wired Connection' (a863e20e-02bb-411a-8596-0330cee8cae1)
[   94.033963] NetworkManager[541]: <info>  [1591289268.7926] device (eth0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
[   94.034459] NetworkManager[541]: <info>  [1591289268.7926] device (eth0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
[   94.034944] NetworkManager[541]: <info>  [1591289268.7928] device (eth0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
[   94.035462] NetworkManager[541]: <info>  [1591289268.7929] dhcp4 (eth0): activation: beginning transaction (timeout in 45 seconds)
[   95.022718] NetworkManager[541]: <warn>  [1591289269.7832] dhcp4 (eth1): request timed out
[   95.023165] NetworkManager[541]: <info>  [1591289269.7833] dhcp4 (eth1): state changed unknown -> timeout
[   95.025007] NetworkManager[541]: <info>  [1591289269.7856] dhcp4 (eth1): canceled DHCP transaction
[   95.025418] NetworkManager[541]: <info>  [1591289269.7856] dhcp4 (eth1): state changed timeout -> done
[   97.023037] NetworkManager[541]: <warn>  [1591289271.7832] dhcp6 (eth1): request timed out
[   97.023439] NetworkManager[541]: <info>  [1591289271.7833] dhcp6 (eth1): state changed unknown -> timeout
[   97.023824] NetworkManager[541]: <info>  [1591289271.7834] dhcp6 (eth1): canceled DHCP transaction
[   97.024207] NetworkManager[541]: <info>  [1591289271.7834] dhcp6 (eth1): state changed timeout -> done
[   97.024570] NetworkManager[541]: <info>  [1591289271.7834] device (eth1): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')
[   97.025134] NetworkManager[541]: <warn>  [1591289271.7838] device (eth1): Activation: failed for connection 'Wired Connection'
[   97.025569] NetworkManager[541]: <info>  [1591289271.7839] device (eth1): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
[  139.032964] NetworkManager[541]: <warn>  [1591289313.7930] dhcp4 (eth0): request timed out
[  139.033467] NetworkManager[541]: <info>  [1591289313.7931] dhcp4 (eth0): state changed unknown -> timeout
[  139.035145] NetworkManager[541]: <info>  [1591289313.7956] dhcp4 (eth0): canceled DHCP transaction
[  139.035525] NetworkManager[541]: <info>  [1591289313.7957] dhcp4 (eth0): state changed timeout -> done
[  139.035883] NetworkManager[541]: <info>  [1591289313.7958] device (eth0): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')
[  139.036464] NetworkManager[541]: <info>  [1591289313.7958] manager: NetworkManager state is now DISCONNECTED
[  139.036931] NetworkManager[541]: <info>  [1591289313.7958] manager: startup complete
[  139.037279] NetworkManager[541]: <info>  [1591289313.7958] quitting now that startup is complete
[  139.038746] NetworkManager[541]: <warn>  [1591289313.7962] device (eth0): Activation: failed for connection 'Wired Connection'
[  139.039211] NetworkManager[541]: <info>  [1591289313.7971] exiting (success)

I know about the current issue with static networks. But I'm not sure if my configuration is correct. Is it currently possible to apply a workaround to avoid this DHCP requests?

dustymabe commented 4 years ago

I know about the current issue with static networks.

Which issue are you referring to?

But I'm not sure if my configuration is correct. Is it currently possible to apply a workaround to avoid this DHCP requests?

I think you are asking "is it possible to not request DHCP during the initramfs on firstboot?". Is that correct?

Assuming that is the question you are asking part of the answer depends on what platform you are running on, but in general the answer is "yes" because you can remove the `ip=dhcp rd.neednet=1" from the kernel command line when you boot the machine. Also soon we will be doing some more intelligent checking and automatically not bring up the network in the initramfs on first boot if you don't need it.

remoe commented 4 years ago

Which issue are you referring to?

https://github.com/coreos/fedora-coreos-tracker/issues/460

I think you are asking "is it possible to not request DHCP during the initramfs on firstboot?". Is that correct?

yes

Assuming that is the question you are asking part of the answer depends on what platform you are running on, but in general the answer is "yes" because you can remove the `ip=dhcp rd.neednet=1" from the kernel command line when you boot the machine. Also soon we will be doing some more intelligent checking and automatically not bring up the network in the initramfs on first boot if you don't need it.

Thanks for this! Yes the kernel boot parameters includes this:

BOOT_IMAGE=(hd0,gpt1)/ostree/fedora-coreos-3f0f8290c4dbaf9270d7eae910c07b016a61206136941fa97915fda63b11c32c/vmlinuz-5.6.8-200.fc31.x86_64 mitigations=auto,nosmt systemd.unified_cgroup_hierarchy=0 console=tty0 console=ttyS0,115200n8 ignition.firstboot rd.neednet=1 ip=dhcp,dhcp6 ostree=/ostree/boot.1/fedora-coreos/3f0f8290c4dbaf9270d7eae910c07b016a61206136941fa97915fda63b11c32c/0 ignition.platform.id=vmware

But I haven't found a sample how one can remove ip=dhcp,dhcp6 with Ignition.

dustymabe commented 4 years ago

The rd.neednet=1 ip=dhcp,dhcp6 options get dynamically added there only on first boot. If you stop grub and manually remove them on that first boot and your system is still happy you don't have to worry about it on subsequent boots at all.

As I mentioned before all of this is going to change soon (with #460) so you won't need to do this, assuming your Ignition configuration doesn't need the network.

remoe commented 4 years ago

Thanks, but i need it on first boot without manual editing, because I setup a bunch of VM's with terraform and one of this VM provide DNS and DHCP. This need to boot fast without hang on not existing DHCP. Can I test the latest changes with an official test release?

dustymabe commented 4 years ago

Thanks, but i need it on first boot without manual editing, because I setup a bunch of VM's with terraform and one of this VM provide DNS and DHCP.

A bunch of VMs - that's great!

This need to boot fast without hang on not existing DHCP. Can I test the latest changes with an official test release?

Unfortunately the code is under review right now and hasn't landed in our development builds yet, but is expected to soon. It looks like from the original title of this issue you're using the OVA file for VMWare? I know how to crack open and edit a qcow2, but not the OVA. Here's how you would do it for the qcow2: virt-edit -a ./foo.qcow2 -m /dev/sda1 /ignition.firstboot and delete the set ignition_network_kcmdline='rd.neednet=1 ip=dhcp,dhcp6' line (making the file empty). If you figure out how to do that for the OVA you could then take that image and use it for your VMs.

remoe commented 4 years ago

Thanks, I found an interesting blog article about something similarly: https://www.openshift.com/blog/openshift-upi-using-static-ips. I try to modify the OVA.

lucab commented 4 years ago

@remoe this is a work in progress at the moment. The logic is implemented in https://github.com/coreos/afterburn/pull/404 but the OS plumbing part is still ongoing. I'll add some docs when this is readily available in FCOS, keep watching this space.

remoe commented 4 years ago

@lucab, thanks! Here is a hint for others to solve this issue temporary:

wget  https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/31.20200517.3.0/x86_64/fedora-coreos-31.20200517.3.0-vmware.x86_64.ova
sudo apt-get install qemu-utils
tar xfv fedora-coreos-31.20200517.3.0-vmware.x86_64.ova pkg
qemu-img convert -f vmdk -O raw pkg/disk.vmdk disk.raw
mkdir disk
sudo mount -o loop,offset=1048576 disk.raw disk
- edit disk/"ignition.firstboot": set ignition_network_kcmdline=`` 
qemu-img convert -f raw -O vmdk disk.raw pkg/disk.vmdk -o adapter_type=lsilogic,subformat=streamOptimized,compat6
ovftool --lax --sourceType=OVF --targetType=OVA pkg/coreos.ovf fedora-coreos-31.20200517.3.0-vmware-nodhcp.x86_64.ova
dustymabe commented 4 years ago

and you're unblocked?

cgwalters commented 4 years ago

Right in https://bugzilla.redhat.com/show_bug.cgi?id=1836248 we agreed to support this workflow.

But to emphasize first Luca's response, we have a much better fix for this incoming for VSphere, and the Live ISO is a full scriptable tool for complex networking that applies across bare metal, VSphere etc.

remoe commented 4 years ago

@dustymabe , yes it boot without DHCP requests. Great!

remoe commented 4 years ago

Is it correct that 32.20200601.3.0 should support "guestinfo.afterburn.initrd.network-kargs" (https://github.com/coreos/afterburn/pull/404/commits/d4aa045419fa4d3008c6bd19c0a87e136435feac) ? (because it use afterburn-4.4.0). So this should fix this issue?

lucab commented 4 years ago

@remoe no, because https://github.com/coreos/fedora-coreos-config/pull/458 has not reached the stable stream yet. It should be available on current tip of testing stream though, i.e. 32.20200615.2.0.

remoe commented 4 years ago

@lucab thanks, but I don't understand the "Release Date: 16 Jun, 2020". This date has nothing to do with image version name?

image

lucab commented 4 years ago

I don't understand the "Release Date: 16 Jun, 2020". This date has nothing to do with image version name?

Correct (if you read some documentation that states otherwise, please let us know as it may need to be fixed).

The second field in a release version is only to track when we snapshot the set of packages from Fedora. For stable, there is an almost guaranteed mismatch with the release date, due to the stabilization time and testing->stable promotion. This is covered with more details in https://github.com/coreos/fedora-coreos-tracker/blob/master/Design.md#version-numbers.

jlebon commented 4 years ago

Hmm, maybe we should add a Snapshot Date: or similar under Release Data: to make this clearer.

dustymabe commented 4 years ago

@remoe - were you able to test out the latest testing stream release?

remoe commented 4 years ago

@dustymabe I've tried it today with following "vsphere_virtual_machine" parameters:

fedora-coreos-32.20200615.2.1-vmware.x86_64.ova

resource "vsphere_virtual_machine" "test-vm" {
...
  vapp {
    properties = {
      "guestinfo.ignition.config.data.encoding" = "base64"
      "guestinfo.ignition.config.data"          = base64encode(data.ct_config.simple-vm.rendered)
      "guestinfo.afterburn.initrd.network-kargs" = ""
    }
  }
}

But this throws the following error:

unsupported vApp properties in vapp.properties: [guestinfo.afterburn.initrd.network-kargs]

@lucab this should be work without any govc call, right? Don't know if this need to be changed: https://github.com/coreos/coreos-assembler/blob/master/src/vmware-template.xml#L84 ?

lucab commented 4 years ago

@remoe yes, it will work with anything which can set guestinfo properties.

I think you are hitting a known limitation in Terraform, see https://www.terraform.io/docs/providers/vsphere/r/virtual_machine.html#using-vapp-properties-to-supply-ovf-ova-configuration.

The template does not (and should not) be changed because that property is optional, it has already a distribution default and an empty value is a valid override (like in your case).

You should check this with some Terraform expert (which I'm not), but I think some viable approaches to try are:

If any of the above works, it would be nice to note that down here.

remoe commented 4 years ago

@lucab, extra_config doesn't make something. So it doesn't work? Then i've created the vapp property "guestinfo.afterburn.initrd.network-kargs" in the template. After that I can create the VM with:

    properties = {
      "guestinfo.ignition.config.data.encoding" = "base64"
      "guestinfo.ignition.config.data"          = base64encode(data.ct_config.simple-vm.rendered)
      "guestinfo.afterburn.initrd.network-kargs" = ""
    }

After boot it has still "rd.neednet=1" in the boot cmdline (/proc/cmdline):

BOOT_IMAGE=(hd0,gpt1)/ostree/fedora-coreos-d00b3bc7e6d2b2fecf92b37cbddf9d45284cea92b477234e37d32dd277083ab9/vmlinuz-5.6.18-300.fc32.x86_64 mitigations=auto,nosmt systemd.unified_cgroup_hierarchy=0 console=tty0 console=ttyS0,115200n8 ignition.firstboot rd.neednet=1 ostree=/ostree/boot.1/fedora-coreos/d00b3bc7e6d2b2fecf92b37cbddf9d45284cea92b477234e37d32dd277083ab9/0 ignition.platform.id=vmware

And it hang here:

image

lucab commented 4 years ago

Then i've created the vapp property "guestinfo.afterburn.initrd.network-kargs" in the template.

For reference, how did you do that? Is the importing and setting flow automated as well?

After boot it has still "rd.neednet=1" in the boot cmdline

Yes, that's expected. It is going to be tackled with https://github.com/coreos/fedora-coreos-tracker/issues/443 but Ignition needs to be enhanced first.

For the moment you can only setup the static address directly via ip= config to avoid DHCP, but you cannot opt-out networking as a whole.

remoe commented 4 years ago

Another try:

image

BOOT_IMAGE=(hd0,gpt1)/ostree/fedora-coreos-d00b3bc7e6d2b2fecf92b37cbddf9d45284cea92b477234e37d32dd277083ab9/vmlinuz-5.6.18-300.fc32.x86_64 mitigations=auto,nosmt systemd.unified_cgroup_hierarchy=0 console=tty0 console=ttyS0,115200n8 ignition.firstboot rd.neednet=1 ostree=/ostree/boot.1/fedora-coreos/d00b3bc7e6d2b2fecf92b37cbddf9d45284cea92b477234e37d32dd277083ab9/0 ignition.platform.id=vmware

no ip= added.

lucab commented 4 years ago

@remoe that's correct, you won't see it in the bootloader command-line, as it is applied later on. If you check your logs, there is a further log line with a BOOT_IMAGE entry produced by Dracut which should have that (it should also contain some btrfs parameter).

remoe commented 4 years ago

@lucab journalctl -u dracut-cmdline --no-pager

dracut-cmdline[317]: Using kernel command line parameters: rd.driver.pre=btrfs ip=dhcp,dhcp6 BOOT_IMAGE=(hd0,gpt1)/ostree/fedora-coreos-d00b3bc7e6d2b2fecf92b37cbddf9d45284cea92b477234e37d32dd277083ab9/vmlinuz-5.6.18-300.fc32.x86_64 mitigations=auto,nosmt systemd.unified_cgroup_hierarchy=0 console=tty0 console=ttyS0,115200n8 ignition.firstboot rd.neednet=1 ostree=/ostree/boot.1/fedora-coreos/d00b3bc7e6d2b2fecf92b37cbddf9d45284cea92b477234e37d32dd277083ab9/0 ignition.platform.id=vmware
lucab commented 4 years ago

@remoe ip=dhcp,dhcp6 is the distribution default.

That seems to hint at the fact that the guestinfo is not properly set on the VM. Can you please first try the govc way (documentation PR is in flight here), so that we can find out where the issue is by excluding components?

remoe commented 4 years ago

@lucab, thanks. This works:

VM_NAME="test-vm"
IPCFG="ip=x.x.x.x::x.x.x.x:255.255.255.224:${VM_NAME}:eth0:off"
govc vm.clone -host=host -ds=ds -on=false -vm template-coreos-32.20200615.2.1 ${VM_NAME}
govc vm.change -vm "${VM_NAME}" -e "guestinfo.afterburn.initrd.network-kargs=${IPCFG}"
govc vm.power -on "${VM_NAME}"
remoe commented 4 years ago

@lucab it is solved ;) It works with terraform with following parameters:

vapp {
    properties = {
        "guestinfo.ignition.config.data.encoding" = "base64"
        "guestinfo.ignition.config.data"          = base64encode(data.ct_config.simple-vm.rendered)
    }
}

extra_config = {
    # https://www.man7.org/linux/man-pages/man7/dracut.cmdline.7.html
    "guestinfo.afterburn.initrd.network-kargs" = "ip=x.x.x.x::x.x.x.x:255.255.255.224:mynode01:eth0:off"
}

And the template is not needed to be changed!

lucab commented 4 years ago

@remoe ack, that's very good feedback. So there seems to be some friction between terraform, vapp properties and guestinfo entries. I don't have a proper terraform setup at hand to play with right now, but I'll keep this on my radar. Thanks for experimenting with us!

remoe commented 4 years ago

@lucab , thanks as well :) I'm happy to help.

dustymabe commented 4 years ago

Thanks @remoe for helping confirm this. Since the fix is in testing stream release 32.20200615.2.0 and newer I'm going to close this. I will also update the ticket when it hits stable.

cgwalters commented 4 years ago

The OpenShift installer currently uses Terraform as a backend on some platforms, so this is also useful to prove out how IPI VSphere could make use of this for static IP addressing (if desired). Thanks for trying it out!

dustymabe commented 4 years ago

The fix for this went into stable stream release 32.20200615.3.0.