bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.58k stars 506 forks source link

occasional build failures after extracting subpackages #3926

Closed bcressey closed 4 months ago

bcressey commented 4 months ago

Since #3891 we've observed occasional failures in nightly CI runs that test the upgrade/downgrade functionality. #3913 included a fix I suggested that didn't actually help, so I spent more time thinking about the cause.

As part of a long-ago fix in #1291, the build system emits marker files for every RPM that comes out of a package build. Prior to the static-pods extraction, we would end up with marker files like this:

build/state/aarch64/packages/os/bottlerocket-static-pods-debuginfo-0.0-0.aarch64.rpm.buildsys_marker
build/state/aarch64/packages/os/bottlerocket-static-pods-0.0-0.aarch64.rpm.buildsys_marker

Those marker files act as an instruction to buildsys to remove the corresponding files from build/rpms whenever the package - in this case, os - is rebuilt.

Following the refactor, we now emit these files for static-pods:

build/state/aarch64/packages/static-pods/bottlerocket-static-pods-debuginfo-0.0-0.aarch64.rpm.buildsys_marker
build/state/aarch64/packages/static-pods/bottlerocket-static-pods-0.0-0.aarch64.rpm.buildsys_marker

The bug happens as a result of this sequence of events:

  1. In a local workspace, check out a commit from before the package was factored out.
  2. Run the build, which creates markers under os for the RPMs.
  3. Check out a commit from after the package was factored out.
  4. Run the build again.
  5. If static-pods finishes building before os starts building, then when os starts, buildsys will remove the RPMs that the static-pods package generated, because os still has marker files for those RPMs.

The following script can reproduce the bug fairly reliably, by limiting the max parallel builds to one, which makes it more likely to hit the problematic sequence:

#!/bin/bash
cargo make clean

export BUILDSYS_VARIANT=aws-k8s-1.23

git checkout upstream/1.19.x
cargo make

git checkout develop
cargo make -e BUILDSYS_JOBS=1

The fix is to change the versions of any package that previously existed as a subpackage. That way, any marker files from an older build of the previous parent package will not find any output to remove.

bcressey commented 4 months ago

Fixed in #3927, but need to make sure it doesn't come back in #3700.