kola testiso `iso-live-login` scenario failing

gursewak1997 commented 2 years ago

iso-live-login times out in UEFI branch. The test has failed inconsistently recently:

From the logs, it seems like the console isn’t complete. Final message is: [ 2.966523] ignition[594]: Ignition finished successfully Console log:console.txt

dustymabe commented 2 years ago

Since we implemented https://github.com/coreos/fedora-coreos-pipeline/pull/557 we haven't seen this in the main pipeline AFAICT, but we have seen it twice in the bump-lockfile job:

saqibali-2k commented 2 years ago

Saw this issue in: https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/blue/organizations/jenkins/build/detail/build/900/pipeline/346. In this error, the issue was not in the UEFI branch

dustymabe commented 2 years ago

Saw this issue in: https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/blue/organizations/jenkins/build/detail/build/900/pipeline/346. In this error, the issue was not in the UEFI branch

That one was also in the iso-offline-install step.

We saw this iso-offline-install failure again here today.

prestist commented 2 years ago

In running releases for fedora, I ran into this error here https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/build/21/

dustymabe commented 2 years ago

Seems to have showed up in rawhide: build#127

dustymabe commented 2 years ago

I tried to track this down. It appears to me that run-media-iso.mount just never gets started and thus everything else gets stalled and the boot just never continues. The tests then timeout after 10 minutes.

I'm really at a loss for what's going on.

dustymabe commented 2 years ago

Background context: This issue is hard to reproduce but pops up enough for it to bother us. I can't reproduce the issue locally but can if I run the test in a tight loop in our build infra.

OK. After all my testing I think this is some kind of systemd bug. I instrumented kola AND the build FCOS images so that I could get actual journal logs from the initrd (see https://github.com/dustymabe/coreos-assembler/commit/fb381182cd359b6ffc464ce85f573dd6b40193d8 which is a random commit that explains how to do it). I then started comparing the journal from the good and bad runs.

For example:

What I found is that for the relevant units from live-generator on a bad run the sysroot-xfs-ephemeral-mkfs.service and run-ephemeral.mount run, but none of the other ones do.

One thing to note is that in a bad run the run-ephemeral.mount seems to run a lot earlier than it does in the good run. I tried to force this behavior to see if I could reproduce the issue artificially and I couldn't, but maybe it is relevant in some way.

In order to gain some more clarity about what was blocking things from continuing I decided to write a unit that would print out some information after a period of sleep in the initrd:

diff --git a/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh
index 6a91048d..988177ad 100644
--- a/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh
+++ b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh
@@ -52,4 +52,7 @@ install() {

     install_and_enable_unit "coreos-livepxe-persist-osmet.service" \
         "default.target"
+
+    install_and_enable_unit "sleep-emergency.service" \
+        "default.target"
 }
diff --git a/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/sleep-emergency.service b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/sleep-emergency.service
new file mode 100644
index 00000000..06b50f11
--- /dev/null
+++ b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/sleep-emergency.service
@@ -0,0 +1,10 @@
+[Unit]
+OnFailure=emergency.target
+OnFailureJobMode=isolate
+[Service]
+Type=simple
+StandardOutput=journal
+StandardError=journal
+ExecStart=bash -c "sleep 30; systemctl list-jobs --after --before; false"
+[Install]
+RequiredBy=basic.target

So this unit will wait for 30 seconds and then print some information and then exit with failure, triggering emergency.target. On a successful boot this unit will get killed during the sleep. On a stuck boot this unit will kick in.

I managed to get a failure to occur with this in place. The journal didn't have all of the printed output so I'm attaching the console-emergency.txt.

The interesting piece is here:

[    2.750638] ignition[587]: disks: disks passed
[    2.750946] systemd[1]: Reached target initrd-root-device.target - Initrd Root Device.
[    2.751498] ignition[587]: Ignition finished successfully
[    2.751884] systemd[1]: ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions was skipped because of a failed condition check (ConditionPathIsDirectory=/run/ignition-ostree-transposefs).
[   32.654964] systemd[1]: Mounting run-media-iso.mount - /run/media/iso...
         Mounting run-media-iso.mount - /run/media/iso...
[   32.659865] bash[591]: JOB UNIT                                                                  TYPE  STATE
[   32.660557] bash[591]: 68  ignition-ostree-growfs.service                                        start waiting
[   32.661221] bash[591]: └─    waiting for job 52 (initrd-root-fs.target/start)                     -     -

You can see that the boot stopped progressing at 2 seconds into the boot, but then at 32s (when the sleep ended) run-media-iso.mount suddenly started to progress (at this point it was too late, though because we were heading to emergency.target.

You can also see from the cleaned up systemctl list-jobs output (systemctl-list-jobs.txt) that run-media-iso.mount appears to be the only unit without something listed that it's blocked by:

[   32.823915] bash[591]: 41  run-media-iso.mount                                                   start running
[   32.824586] bash[591]: └─    waiting for job 40 (coreos-liveiso-persist-osmet.service/start)      -     -
[   32.825234] bash[591]: └─    waiting for job 52 (initrd-root-fs.target/start)                     -     -
[   32.825894] bash[591]: └─    waiting for job 56 (sysroot.mount/start)                             -     -

NOTE: I find the systemctl list-jobs --after --before output confusing. The items listed as waiting for are the units that are blocked by run-media-iso.mount.

Ultimately it looks like there is some internal state somewhere that knows run-media-iso.mount needs to run but nothing telling it to run. Without the unit I added (or some other trigger) the boot hangs and the tests timeout after 10 minutes.

I can't reproduce this with systemd debug logging enabled, which makes sense because there is probably something in there that performs the necessary kick to get the system out of this hung state.

dustymabe commented 2 years ago

I'm honestly not really sure how to go about peeling back the complexity here to open a systemd issue upstream. Ultimately I think I'm going to drop in a workaround and call it a day.

It appears in order to get the system out of this state you simply need to do something (anything). So we'll just add another unit that waits 10 seconds and then prints a message. We'll also add some instrumentation in our pipeline to let us know when we hit this.

dustymabe commented 2 years ago

https://github.com/coreos/fedora-coreos-config/pull/1961

dustymabe commented 2 years ago

and here's the code to notify us when this workaround pops up: https://github.com/coreos/fedora-coreos-pipeline/pull/624

If it happens often enough and we get annoyed with it we can just delete the warning in the future.

dustymabe commented 2 years ago

And we see the warning is working. https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/build/235/ notified us that the workaround was detected. The test passed as it should have.

dustymabe commented 1 year ago

We haven't seen this warning in a good while so we don't think the workaround is needed any longer. Let's drop it and see if anything breaks: https://github.com/coreos/fedora-coreos-config/pull/2230

dustymabe commented 1 year ago

Closing this since https://github.com/coreos/fedora-coreos-config/pull/2230 removed the code for this. Will re-open if we re-encounter any issues.

coreos / fedora-coreos-tracker

kola testiso `iso-live-login` scenario failing #1233