fedora-infra / koschei

Continuous integration for Fedora packages
GNU General Public License v2.0
37 stars 15 forks source link

koschei seems to rebuild failed packages too quickly #310

Closed mizdebsk closed 4 years ago

mizdebsk commented 4 years ago

Originally reported by @nirik at https://pagure.io/koschei/issue/2

koschei is rebuilding failed packages very quickly, causing buildroots to fill up builders.

For example, Jan 27th, there were more than 600 builds of python-debtcollector. At ~350MB per buildroot and multiplied by all the similar packages, the builders fill up.

This might be the set of packages that don't even build their src.rpm?

mizdebsk commented 4 years ago

This is caused by several factors combined together:

  1. rebuildSRPM task introduced in Koji - when doing builds from pre-existing SRPM, Koji >= 1.18 spawns rebuildSRPM task to rebuild SRPM before the actual buildArch tasks. This is not necessary in most cases and only leads to waste of resources. I would like Koji to implement an option that would allow Koschei to skip rebuildSRPM task. https://pagure.io/koji/issue/1719
  2. Koschei treats failure of rebuildSRPM task as Koji fault, not problem in the package itself - Koschei points Koji to SRPM that already exists, but Koji insists on rebuilding it and fails. Builds that end up in Koji fault are ignored (like they never existed). https://github.com/fedora-infra/koschei/pull/74
  3. failed_buildroot_lifetime setting of kojid - since year 2019, Fedora Koji keeps buildroots of failed tasks for 24 hours, much longer than it used to. https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=55b5ef7

Fixing each single of the above issues should fix the whole problem. IMO the best long-term fix is to fix nr 1 (by adding rebuild_srpm option of BuildTask in Koji and then making Koschei submit scratch builds with rebuild_srpm=False).

mizdebsk commented 4 years ago

311 should also help to mitigate this issue - builds will be submitted from SCM instead of SRPM

nirik commented 4 years ago

Sounds good, I wonder if we couldn't also remove/not keep buildroots for koschei jobs?

mizdebsk commented 4 years ago

311 cannot be enabled in production yet (see https://github.com/fedora-infra/koschei/issues/276#issuecomment-587463533 for details why not), therefore I'm working on a different fix for this issue.

mizdebsk commented 4 years ago

Using the aforementioned day of Jan 27th as example, Koschei submitted 18502 scratch builds, including 661 scratch builds for python-debtcollector, all of which failed, probably all due to rebuildSRPM task failure. Example of such scratch build: https://koji.fedoraproject.org/koji/taskinfo?taskID=41111817 I will try to reproduce the issue as unit test.

mizdebsk commented 4 years ago

Verified in staging as follows:

Reproduced the issue in staging Koschei:

Then I've deployed fixed version and retested. New build was submitted and failed: https://koji.stg.fedoraproject.org/koji/taskinfo?taskID=90009536 This time build was not removed from Koschei DB: https://koschei.stg.fedoraproject.org/build/21577

Consuming message from topic org.fedoraproject.stg.buildsys.task.state.change (message id 2c7dfc94-c8e0-416d-9d68-3044c7f8c494)
Setting build Build(id=21577, package=python-debtcollector, collection=f32, state=running, task_id=90009536) state to failed
Successfully consumed message from topic org.fedoraproject.stg.buildsys.task.state.change (message id 2c7dfc94-c8e0-416d-9d68-3044c7f8c494)

Therefore I consider the fix to be verified in staging.

mizdebsk commented 4 years ago

Fix was deployed to production.