NixOS / hydra

Hydra, the Nix-based continuous build system
http://nixos.org/hydra
GNU General Public License v3.0
1.1k stars 291 forks source link

Make "timed out" and "log limit exceeded" builds aborted #1369

Open Ma27 opened 3 months ago

Ma27 commented 3 months ago

In 73694087a088ed4481b4ab268a03351b1bcaac3c I gave builds that failed because of a timeout or exceeded log limit a stop sign and I stand by that reasoning: with that it's possible to distinguish between actual build failures and rather transient things such as timeouts.

Back then I considered it a feature that these are shown in a different tab, but I don't think that's a good idea anymore. When using a jobset to e.g. track the regressions from a mass rebuild (like a compiler or gcc update), "Newly failed builds" should exclusively display regressions (and flaky builds of course, not much I can do about that).

Also, when a bunch of builds fail in such a jobset because of e.g. a broken connection to a builder that results in a timeout, I want to be able to restart them all w/o rebuilding actual regressions.

To make it clear that we not only have "Aborted" builds in the tab, I renamed the label to "Aborted / Timed out".

cc @NixOS/infra who will probably be the most affected by that for opinions.

mweinelt commented 3 months ago

Deployed on hydra.nixos.org.

vcunat commented 3 months ago

I find it a little weird that "restart all aborted jobs" now won't restart this whole tab but only a subset of it (won't restart the timed out jobs).

Ma27 commented 3 months ago

OK I only tested it until the point of where a certain job was displayed in a jobset eval. You're right, that should be part of that PR as well.

We should probably factor the logic on when something is aborted out to get rid of this duplication (in the restart, the jobset eval template and the jobset diff).

vcunat commented 3 months ago

Also for hydra.nixos.org there's a practical disadvantage that timing out jobs can't be filtered out when doing diffs, i.e. instead of in "still failing" tab they'll be in "aborted" tab. I mean the timeouts that aren't transient but just a bad build, though yeah – we'd better fix those somehow.

Ma27 commented 3 months ago

@vcunat part of the motivation was that I don't want to see timed out builds in "newly failing" (and subsequently I don't want to restart actual regressions when restarting timed out builds) and aborted builds seem more sensible here.

Perhaps we want to add another tab? Doesn't seem too complicated I guess, the categorization happens entirely outside of the DB AFAIK.

vcunat commented 2 months ago

Note: "log limit exceeded" is usually not transient, in my experience. Restarting such builds will typically produce too much logs again, so I'm not sure about the move in that case. I consider it similar to "output size exceeded".

Though yes, it's somewhat impure – at least in the sense that these limits are configurable without changing the derivation hash.