coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
264 stars 59 forks source link

kola-upgrade test fails on secureboot #1756

Closed jbtrystram closed 2 months ago

jbtrystram commented 4 months ago

All the recent kola-upgrades fails on secure boot. It starts on 34.20210427.2.0, then zincatti stage the update :

zincati[2532]: [INFO ] target release '34.20210611.2.0' selected, proceeding to stage it

But then it never reboot to the next version.

See https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/kola-upgrade/4349/display/redirect for an instance of the issue

jlebon commented 4 months ago

It seems quite consistent. I would try to reproduce this locally, first by running the test (and e.g. tailing the console), or simply by booting the bootimage from that starting version and letting Zincati do its thing (you'd need to apply the same workarounds that the upgrade test does, notably for 34+, the GPG key workaround).

marmijo commented 3 months ago

We hit this issue today with the testing and stable releases:

marmijo commented 3 months ago

It's interesting that we only see this failure on the upgrade from F34 -> F40 and only on testing and stable. The test actually excludes running secure boot for this same upgrade versions on the next stream: https://github.com/coreos/fedora-coreos-pipeline/blob/main/jobs/kola-upgrade.Jenkinsfile#L203-L205

Is this something we should implement for the other two streams, or was it specific to an issue in next in the past?

dustymabe commented 2 months ago

It's interesting that we only see this failure on the upgrade from F34 -> F40 and only on testing and stable. The test actually excludes running secure boot for this same upgrade versions on the next stream: https://github.com/coreos/fedora-coreos-pipeline/blob/main/jobs/kola-upgrade.Jenkinsfile#L203-L205

Since we start building next on the next major version of Fedora sooner than any other production stream the versions of RPMs in F34 in the first next builds probably have the older software that's now denylisted in the DBX, which is why the earliest F34 next versions won't work, but the earliest F34 stable and testing versions will work.

Is this something we should implement for the other two streams, or was it specific to an issue in next in the past?

No, we shouldn't implement it for the other two streams unless new information comes to light that tells us we need to.

dustymabe commented 2 months ago

I've tried to dig into this issue and I can't find anything that leads to a tangible reason why this is failing fairly consistently. A few observations:

dustymabe commented 2 months ago
  • from the logs it's almost like the machine under test just hangs at some point in the process, but this is just a hunch

here is some more evidence to support 👆

I grabbed a shell into a cosa pod running these tests and eventually ended up with several defunct qemu processes:

cosa         655     463 15 18:58 ?        00:01:27 [qemu-system-x86] <defunct>
cosa         670     532 15 18:58 ?        00:01:27 [qemu-system-x86] <defunct>
cosa         675     592 15 18:58 ?        00:01:27 [qemu-system-x86] <defunct>
dustymabe commented 2 months ago

It turns out we weren't giving the pod enough memory to account for any overhead that the kola and qemu processes would be using.

https://github.com/coreos/fedora-coreos-pipeline/pull/1026 should handle this.

dustymabe commented 2 months ago

this should be fixed with https://github.com/coreos/fedora-coreos-pipeline/pull/1026