Closed jbtrystram closed 2 months ago
It seems quite consistent. I would try to reproduce this locally, first by running the test (and e.g. tail
ing the console), or simply by booting the bootimage from that starting version and letting Zincati do its thing (you'd need to apply the same workarounds that the upgrade test does, notably for 34+, the GPG key workaround).
We hit this issue today with the testing
and stable
releases:
It's interesting that we only see this failure on the upgrade from F34 -> F40 and only on testing
and stable
.
The test actually excludes running secure boot for this same upgrade versions on the next
stream: https://github.com/coreos/fedora-coreos-pipeline/blob/main/jobs/kola-upgrade.Jenkinsfile#L203-L205
Is this something we should implement for the other two streams, or was it specific to an issue in next
in the past?
It's interesting that we only see this failure on the upgrade from F34 -> F40 and only on
testing
andstable
. The test actually excludes running secure boot for this same upgrade versions on thenext
stream: https://github.com/coreos/fedora-coreos-pipeline/blob/main/jobs/kola-upgrade.Jenkinsfile#L203-L205
Since we start building next
on the next major version of Fedora sooner than any other production stream the versions of RPMs in F34 in the first next
builds probably have the older software that's now denylisted in the DBX, which is why the earliest F34 next
versions won't work, but the earliest F34 stable
and testing
versions will work.
Is this something we should implement for the other two streams, or was it specific to an issue in
next
in the past?
No, we shouldn't implement it for the other two streams unless new information comes to light that tells us we need to.
I've tried to dig into this issue and I can't find anything that leads to a tangible reason why this is failing fairly consistently. A few observations:
- from the logs it's almost like the machine under test just hangs at some point in the process, but this is just a hunch
here is some more evidence to support 👆
I grabbed a shell into a cosa pod running these tests and eventually ended up with several defunct qemu
processes:
cosa 655 463 15 18:58 ? 00:01:27 [qemu-system-x86] <defunct>
cosa 670 532 15 18:58 ? 00:01:27 [qemu-system-x86] <defunct>
cosa 675 592 15 18:58 ? 00:01:27 [qemu-system-x86] <defunct>
It turns out we weren't giving the pod enough memory to account for any overhead that the kola and qemu processes would be using.
https://github.com/coreos/fedora-coreos-pipeline/pull/1026 should handle this.
this should be fixed with https://github.com/coreos/fedora-coreos-pipeline/pull/1026
All the recent kola-upgrades fails on secure boot. It starts on 34.20210427.2.0, then zincatti stage the update :
But then it never reboot to the next version.
See https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/kola-upgrade/4349/display/redirect for an instance of the issue