Closed gursewak1997 closed 1 year ago
Since we implemented https://github.com/coreos/fedora-coreos-pipeline/pull/557 we haven't seen this in the main pipeline AFAICT, but we have seen it twice in the bump-lockfile job:
Saw this issue in: https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/blue/organizations/jenkins/build/detail/build/900/pipeline/346. In this error, the issue was not in the UEFI branch
Saw this issue in: https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/blue/organizations/jenkins/build/detail/build/900/pipeline/346. In this error, the issue was not in the UEFI branch
That one was also in the iso-offline-install
step.
We saw this iso-offline-install
failure again here today.
In running releases for fedora, I ran into this error here https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/build/21/
I tried to track this down. It appears to me that run-media-iso.mount
just never gets started and thus everything else gets stalled and the boot just never continues. The tests then timeout after 10 minutes.
I'm really at a loss for what's going on.
Background context: This issue is hard to reproduce but pops up enough for it to bother us. I can't reproduce the issue locally but can if I run the test in a tight loop in our build infra.
OK. After all my testing I think this is some kind of systemd bug. I instrumented kola
AND the build FCOS images so that I could get actual journal logs from the initrd (see https://github.com/dustymabe/coreos-assembler/commit/fb381182cd359b6ffc464ce85f573dd6b40193d8 which is a random commit that explains how to do it). I then started comparing the journal from the good and bad runs.
For example:
What I found is that for the relevant units from live-generator on a bad run the sysroot-xfs-ephemeral-mkfs.service
and run-ephemeral.mount
run, but none of the other ones do.
One thing to note is that in a bad run the run-ephemeral.mount
seems to run a lot earlier than it does in the good run. I tried to force this behavior to see if I could reproduce the issue artificially and I couldn't, but maybe it is relevant in some way.
In order to gain some more clarity about what was blocking things from continuing I decided to write a unit that would print out some information after a period of sleep in the initrd:
diff --git a/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh
index 6a91048d..988177ad 100644
--- a/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh
+++ b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/module-setup.sh
@@ -52,4 +52,7 @@ install() {
install_and_enable_unit "coreos-livepxe-persist-osmet.service" \
"default.target"
+
+ install_and_enable_unit "sleep-emergency.service" \
+ "default.target"
}
diff --git a/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/sleep-emergency.service b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/sleep-emergency.service
new file mode 100644
index 00000000..06b50f11
--- /dev/null
+++ b/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-live/sleep-emergency.service
@@ -0,0 +1,10 @@
+[Unit]
+OnFailure=emergency.target
+OnFailureJobMode=isolate
+[Service]
+Type=simple
+StandardOutput=journal
+StandardError=journal
+ExecStart=bash -c "sleep 30; systemctl list-jobs --after --before; false"
+[Install]
+RequiredBy=basic.target
So this unit will wait for 30 seconds and then print some information and then exit with failure, triggering emergency.target. On a successful boot this unit will get killed during the sleep. On a stuck boot this unit will kick in.
I managed to get a failure to occur with this in place. The journal didn't have all of the printed output so I'm attaching the console-emergency.txt.
The interesting piece is here:
[ 2.750638] ignition[587]: disks: disks passed
[ 2.750946] systemd[1]: Reached target initrd-root-device.target - Initrd Root Device.
[ 2.751498] ignition[587]: Ignition finished successfully
[ 2.751884] systemd[1]: ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions was skipped because of a failed condition check (ConditionPathIsDirectory=/run/ignition-ostree-transposefs).
[ 32.654964] systemd[1]: Mounting run-media-iso.mount - /run/media/iso...
Mounting run-media-iso.mount - /run/media/iso...
[ 32.659865] bash[591]: JOB UNIT TYPE STATE
[ 32.660557] bash[591]: 68 ignition-ostree-growfs.service start waiting
[ 32.661221] bash[591]: └─ waiting for job 52 (initrd-root-fs.target/start) - -
You can see that the boot stopped progressing at 2 seconds into the boot, but then at 32s (when the sleep ended) run-media-iso.mount
suddenly started to progress (at this point it was too late, though because we were heading to emergency.target
.
You can also see from the cleaned up systemctl list-jobs
output (systemctl-list-jobs.txt) that run-media-iso.mount
appears to be the only unit without something listed that it's blocked by:
[ 32.823915] bash[591]: 41 run-media-iso.mount start running
[ 32.824586] bash[591]: └─ waiting for job 40 (coreos-liveiso-persist-osmet.service/start) - -
[ 32.825234] bash[591]: └─ waiting for job 52 (initrd-root-fs.target/start) - -
[ 32.825894] bash[591]: └─ waiting for job 56 (sysroot.mount/start) - -
NOTE: I find the systemctl list-jobs --after --before
output confusing. The items listed as waiting for
are the units that are blocked by run-media-iso.mount
.
Ultimately it looks like there is some internal state somewhere that knows run-media-iso.mount
needs to run but nothing telling it to run. Without the unit I added (or some other trigger) the boot hangs and the tests timeout after 10 minutes.
I can't reproduce this with systemd debug logging enabled, which makes sense because there is probably something in there that performs the necessary kick to get the system out of this hung state.
I'm honestly not really sure how to go about peeling back the complexity here to open a systemd issue upstream. Ultimately I think I'm going to drop in a workaround and call it a day.
It appears in order to get the system out of this state you simply need to do something (anything). So we'll just add another unit that waits 10 seconds and then prints a message. We'll also add some instrumentation in our pipeline to let us know when we hit this.
and here's the code to notify us when this workaround pops up: https://github.com/coreos/fedora-coreos-pipeline/pull/624
If it happens often enough and we get annoyed with it we can just delete the warning in the future.
And we see the warning is working. https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/build/235/ notified us that the workaround was detected. The test passed as it should have.
We haven't seen this warning in a good while so we don't think the workaround is needed any longer. Let's drop it and see if anything breaks: https://github.com/coreos/fedora-coreos-config/pull/2230
Closing this since https://github.com/coreos/fedora-coreos-config/pull/2230 removed the code for this. Will re-open if we re-encounter any issues.
iso-live-login times out in UEFI branch. The test has failed inconsistently recently:
From the logs, it seems like the console isn’t complete. Final message is:
[ 2.966523] ignition[594]: Ignition finished successfully
Console log:console.txt