[RHCOS] running kola in qemu_unpriv mode causes crio.service failure to start

miabbott commented 5 years ago

When running kola against RHCOS using qemu_unpriv is causing all the tests to fail because the crio.service is failing to start:

$ bin/kola run --debug --qemu-image rhcos-410.8.20190520.0-qemu.qcow2 -b rhcos -p qemu-unpriv --ignition-version v2

<snip>

=== RUN   coreos.selinux.enforce
=== RUN   coreos.ignition.sethostname
=== RUN   rhcos.selinux.boolean.persist
=== RUN   rpmostree.upgrade-rollback
=== RUN   coreos.selinux.boolean
=== RUN   coreos.ignition.once
=== RUN   rhcos.basic
=== RUN   podman.network
=== RUN   rhcos.selinux.manage
=== RUN   ostree.basic
=== RUN   podman.base
=== RUN   rpmostree.status
=== RUN   coreos.auth.verify
=== RUN   coreos.ignition.v2.users
=== RUN   coreos.ignition.groups
2019-05-22T15:46:50Z platform: disabling auto-read-only for QEMU drives
2019-05-22T15:46:50Z platform/machine/qemu: NewMachine: ["qemu-system-x86_64" "-machine" "accel=kvm" "-cpu" "host" "-m" "1024" "-bios" "bios-256k.bin" "-smp" "1" "-uuid" "ba98fd19-8e2a-499c-ad1e-683684c66a4e" "-di
splay" "none" "-chardev" "file,id=log,path=_kola_temp/qemu-unpriv-2019-05-22-1146-19764/coreos.selinux.enforce/ba98fd19-8e2a-499c-ad1e-683684c66a4e/console.txt" "-serial" "chardev:log" "-fw_cfg" "name=opt/com.core
os/config,file=_kola_temp/qemu-unpriv-2019-05-22-1146-19764/coreos.selinux.enforce/ba98fd19-8e2a-499c-ad1e-683684c66a4e/ignition.json" "-add-fd" "fd=3,set=1" "-drive" "if=none,id=d3,format=qcow2,file=/dev/fdset/1,
auto-read-only=off" "-device" "virtio-blk-pci,drive=d3,serial=primary-disk" "-netdev" "user,id=eth0,restrict=yes,hostfwd=tcp:127.0.0.1:0-:22" "-device" "virtio-net-pci,netdev=eth0"]
--- FAIL: coreos.selinux.enforce (26.64s)
        harness.go:480: Cluster failed starting machines: machine "ba98fd19-8e2a-499c-ad1e-683684c66a4e" failed basic checks: some systemd units failed:
crio.service loaded failed failed Open Container Initiative Daemon
2019-05-22T15:47:17Z platform: disabling auto-read-only for QEMU drives
2019-05-22T15:47:17Z platform/machine/qemu: NewMachine: ["qemu-system-x86_64" "-machine" "accel=kvm" "-cpu" "host" "-m" "1024" "-bios" "bios-256k.bin" "-smp" "1" "-uuid" "cc04c825-fb50-459b-8245-d3722d0bc229" "-di
splay" "none" "-chardev" "file,id=log,path=_kola_temp/qemu-unpriv-2019-05-22-1146-19764/rhcos.selinux.manage/cc04c825-fb50-459b-8245-d3722d0bc229/console.txt" "-serial" "chardev:log" "-fw_cfg" "name=opt/com.coreos
/config,file=_kola_temp/qemu-unpriv-2019-05-22-1146-19764/rhcos.selinux.manage/cc04c825-fb50-459b-8245-d3722d0bc229/ignition.json" "-add-fd" "fd=3,set=1" "-drive" "if=none,id=d3,format=qcow2,file=/dev/fdset/1,auto
-read-only=off" "-device" "virtio-blk-pci,drive=d3,serial=primary-disk" "-netdev" "user,id=eth0,restrict=yes,hostfwd=tcp:127.0.0.1:0-:22" "-device" "virtio-net-pci,netdev=eth0"]
--- FAIL: rhcos.selinux.manage (26.99s)
        harness.go:480: Cluster failed starting machines: machine "cc04c825-fb50-459b-8245-d3722d0bc229" failed basic checks: some systemd units failed:
crio.service loaded failed failed Open Container Initiative Daemon
2019-05-22T15:47:44Z platform: disabling auto-read-only for QEMU drives
2019-05-22T15:47:44Z platform/machine/qemu: NewMachine: ["qemu-system-x86_64" "-machine" "accel=kvm" "-cpu" "host" "-m" "1024" "-bios" "bios-256k.bin" "-smp" "1" "-uuid" "548050f2-5146-47d8-8065-7bf4f7879c7a" "-di
splay" "none" "-chardev" "file,id=log,path=_kola_temp/qemu-unpriv-2019-05-22-1146-19764/coreos.ignition.groups/548050f2-5146-47d8-8065-7bf4f7879c7a/console.txt" "-serial" "chardev:log" "-fw_cfg" "name=opt/com.core
os/config,file=_kola_temp/qemu-unpriv-2019-05-22-1146-19764/coreos.ignition.groups/548050f2-5146-47d8-8065-7bf4f7879c7a/ignition.json" "-add-fd" "fd=3,set=1" "-drive" "if=none,id=d3,format=qcow2,file=/dev/fdset/1,
auto-read-only=off" "-device" "virtio-blk-pci,drive=d3,serial=primary-disk" "-netdev" "user,id=eth0,restrict=yes,hostfwd=tcp:127.0.0.1:0-:22" "-device" "virtio-net-pci,netdev=eth0"]
--- FAIL: coreos.ignition.groups (26.09s)
        harness.go:480: Cluster failed starting machines: machine "548050f2-5146-47d8-8065-7bf4f7879c7a" failed basic checks: some systemd units failed:
crio.service loaded failed failed Open Container Initiative Daemon

Inspecting journal.txt shows what looks like the problem:

May 22 11:47:03.667239 crio[1142]: time="2019-05-22 15:47:03.666874221Z" level=fatal msg="no default routes found in "/proc/net/route" or "/proc/net/ipv6_route""
May 22 11:47:03.671370 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
May 22 11:47:03.671595 systemd[1]: crio.service: Failed with result 'exit-code'.

Not sure why crio wants a default route to start, but it would be great if we could fake out those entries to get past this.

cc: @ashcrow

arithx commented 5 years ago

Notably if you kola spawn into a qemu-unpriv RHCOS machine both /proc/net/route & /proc/net/ipv6_route do exist and restarting crio causes it to enter an active (running) state. I'd argue that's probably a race in RHCOS where crio is activating before the routes are populated.

miabbott commented 5 years ago

Good find @arithx, thanks!

A little more context in the logs shows that routing might not be fully setup in this secnario:

May 22 17:45:10 localhost.localdomain NetworkManager[1008]: <info>  [1558547110.6223] manager: startup complete
May 22 17:45:10 localhost.localdomain systemd[1]: Started Network Manager Wait Online.                                                                                                                               
May 22 17:45:10 localhost.localdomain systemd[1]: Reached target Network is Online.    
May 22 17:45:10 localhost.localdomain systemd[1]: Starting Open Container Initiative Daemon...
May 22 17:45:10 localhost.localdomain systemd[1]: Starting Generate /run/issue.d/console-login-helper-messages.issue...
May 22 17:45:10 localhost.localdomain nm-dispatcher[1088]: req:4 'connectivity-change': new request (3 scripts)
May 22 17:45:10 localhost.localdomain dhclient[1119]: bound to 10.0.2.15 -- renewal in 37152 seconds.
May 22 17:45:10 localhost.localdomain sshd[1133]: Server listening on 0.0.0.0 port 22.
May 22 17:45:10 localhost.localdomain sshd[1133]: Server listening on :: port 22.
May 22 17:45:10 localhost.localdomain systemd[1]: Started OpenSSH server daemon.
May 22 17:45:10 localhost.localdomain systemd[1]: iscsi.service: Unit cannot be reloaded because it is inactive.
May 22 17:45:10 localhost.localdomain nm-dispatcher[1088]: req:4 'connectivity-change': start running ordered scripts...
May 22 17:45:10 localhost.localdomain systemd[1]: Started Generate /run/issue.d/console-login-helper-messages.issue.
May 22 17:45:10 localhost.localdomain systemd[1]: Starting Permit User Sessions...
May 22 17:45:10 localhost.localdomain systemd[1]: Started Permit User Sessions.
May 22 17:45:10 localhost.localdomain systemd[1]: Started Serial Getty on ttyS0.
May 22 17:45:10 localhost.localdomain systemd[1]: Started Getty on tty1.
May 22 17:45:10 localhost.localdomain systemd[1]: Reached target Login Prompts.
May 22 17:45:11 localhost.localdomain crio[1139]: time="2019-05-22 17:45:11.008776919Z" level=fatal msg="no default routes found in "/proc/net/route" or "/proc/net/ipv6_route""
May 22 17:45:11 localhost.localdomain systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
May 22 17:45:11 localhost.localdomain systemd[1]: crio.service: Failed with result 'exit-code'.
May 22 17:45:11 localhost.localdomain systemd[1]: Failed to start Open Container Initiative Daemon.
May 22 17:45:11 localhost.localdomain systemd[1]: Reached target Multi-User System.
May 22 17:45:11 localhost.localdomain systemd[1]: Starting Update UTMP about System Runlevel Changes...
May 22 17:45:11 localhost.localdomain systemd[1]: Started Update UTMP about System Runlevel Changes.
May 22 17:45:11 localhost.localdomain systemd[1]: Startup finished in 1.100s (kernel) + 5.512s (initrd) + 4.018s (userspace) = 10.631s.
May 22 17:45:12 localhost.localdomain NetworkManager[1008]: <info>  [1558547112.4924] manager: NetworkManager state is now CONNECTED_SITE
May 22 17:45:12 localhost.localdomain NetworkManager[1008]: <info>  [1558547112.4935] policy: set 'Wired connection 1' (ens4) as default for IPv6 routing and DNS

Which is unfortunate since cri-o is waiting for network-online.target:

$ systemctl cat cri-o
# /usr/lib/systemd/system/crio.service
[Unit]
Description=Open Container Initiative Daemon
Documentation=https://github.com/kubernetes-sigs/cri-o
After=network-online.target

[Service]
Type=notify
EnvironmentFile=-/etc/sysconfig/crio
EnvironmentFile=-/etc/sysconfig/crio-metrics
EnvironmentFile=-/etc/sysconfig/crio-network
EnvironmentFile=-/etc/sysconfig/crio-storage
Environment=GOTRACEBACK=crash
ExecStart=/usr/bin/crio \
          $CRIO_STORAGE_OPTIONS \
          $CRIO_NETWORK_OPTIONS \
          $CRIO_METRICS_OPTIONS
ExecReload=/bin/kill -s HUP $MAINPID
TasksMax=infinity
LimitNOFILE=1048576
LimitNPROC=1048576
LimitCORE=infinity
OOMScoreAdjust=-999
TimeoutStartSec=0
Restart=on-abnormal

[Install]
WantedBy=multi-user.target

I'm going to close this and follow up with either the cri-o or RHCOS teams (or both).

arithx commented 5 years ago

The network-online.target spins indefinitely pre-systemd 241 (#999).

coreos / mantle

[RHCOS] running kola in qemu_unpriv mode causes crio.service failure to start #1004