NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
18.14k stars 14.17k forks source link

ZERO Hydra Failures 22.05 #172160

Closed dasJ closed 2 years ago

dasJ commented 2 years ago

Mission

Every time we branch off a release we stabilize the release branch. Our goal here is to get as little as possible jobs failing on the trunk/master jobsets. We call this effort "Zero Hydra Failure". I'd like to heighten, while it's great to focus on zero as our goal, it's essentially to have all deliverables that worked in the previous release work here also.

Please note the changes included in RFC 85.

Most significantly, branch off will occur on 2022 May 22; prior to that date, ZHF will be conducted on master; after that date, ZHF will be conducted on the release channel using a backport workflow similar to previous ZHFs.

Jobsets

trunk Jobset (includes linux, darwin, and aarch64-linux builds) nixos/combined Jobset (includes many nixos tests)

How to help (textual)

  1. Select an evaluation of the trunk jobset Screenshot

  2. Find a failed job ❌️ , you can use the filter field to scope packages to your platform, or search for packages that are relevant to you. Screenshot from 2020-02-08 15 26 47 Note: you can filter for architecture by filtering for it, eg: https://hydra.nixos.org/eval/1719540?filter=x86_64-linux&compare=1719463&full=#tabs-still-fail

  3. Search to see if a PR is not already open for the package. It there is one, please help review it.

  4. If there is no open PR, troubleshoot why it's failing and fix it.

  5. Create a Pull Request with the fix targeting master, wait for it to be merged. If your PR causes around 500+ rebuilds, it's preferred to target staging to avoid compute and storage churn. If your PR is fixing Haskell packages, target the haskell-updates branch instead.

  6. (after 2022 May 22) Please follow backporting steps and target the release-22.05 branch if the original PR landed in master or staging-22.05 if the PR landed in staging. Be sure to do git cherry-pick -x <rev> on the commits that landed in unstable. @jonringer created a video covering the backport process.

Always reference this issue in the body of your PR:

ZHF: #172160

Please ping @NixOS/nixos-release-managers on the PR and add the 0.kind: build failure label to the pull request. If you're unable to because you're not a member of the NixOS org please ping @dasJ, @tomberek, @jonringer, @Mic92

How can I easily check packages that I maintain?

I have created an experimental website that automatically crawls Hydra and lists packages by maintainer and lists the most important dependencies (failing packages with the most dependants). You can reach it here: https://zh.fail

If you prefer the command-line way, you can also check failing packages that you maintain by running:

# from root of nixpkgs
nix-build maintainers/scripts/build.nix --argstr maintainer <name>

New to nixpkgs?

Packages that don't get fixed

The remaining packages will be marked as broken before the release (on the failing platforms). You can do this like:

meta = {
  # ref to issue/explanation
  # `true` is for everything
  broken = stdenv.isDarwin; 
};

Closing

This is a great way to help NixOS, and it is a great time for new contributors to start their nixpkgs adventure. :partying_face:

As with the feature freeze issue, please keep discussion here to a minimal so you don't ping all maintainers (although relevant comments can of course be added here if they are directly ZHF-related) and ping me or the release managers team in the respective issues.

cc @NixOS/nixpkgs-committers @NixOS/nixpkgs-maintainers @NixOS/release-engineers

Related Issues

nixos-discourse commented 2 years ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/zero-hydra-failures-22-05/19051/1

raboof commented 2 years ago

Perhaps we should tag PR's that fix hydra failures with 0.kind: build failure to encourage reviewing? https://github.com/NixOS/nixpkgs/pulls?q=is%3Aopen+is%3Apr+label%3A%220.kind%3A+build+failure%22

Edit by @dasJ: I added the instruction into the the the issue description.

GuillaumeDesforges commented 2 years ago

Some packages I maintain have on Hydra the errors

OSError: Too many open files

But they build ok on nixpkgs master locally.

Example: https://hydra.nixos.org/build/175654425/nixlog/1/tail

Not sure of what I can do on my end.

vcunat commented 2 years ago

I restarted some, but that scipy build failed many times in a row so there it doesn't seem to make sense. I'd suggest to try skipping tests that do similar problems. EDIT: https://github.com/NixOS/nixpkgs/issues/170143

sternenseemann commented 2 years ago

For Haskell, please remember to target any PRs to the haskell-updates branch! Edit by @dasJ: I added this hint to the issue description.

Since we've already marked (most) failures as broken, you need to check manually if your favorite package still works, instead of looking at failed builds on Hydra.

Additionally here is a list of more prominent problems (of Hakell packages exposed via top level pkgs) to look into, note that some of these are unmaintained and probably not worth fixing / should be removed in the long run.

KarlJoad commented 2 years ago

Most Octave packages are broken because of a change in the way Octave handles it packages. #168943 has further discussion of this issue.

risicle commented 2 years ago

https://zh.fail/ is cool but I think we may be needing a logarithmic y-axis before long...

neilmayhew commented 2 years ago

I've created an upstream PR for jl.

Mathnerd314 commented 2 years ago

a logarithmic y-axis

That won't work, when we get to 0 the graph will be at negative infinity. We would need a symmetric log plot.

risicle commented 2 years ago

If/when we get to 0 I'm perfectly happy for the graph to explode, in fact it would be a fitting celebration.

vcunat commented 2 years ago

staging-next merged now.

06kellyjac commented 2 years ago

looks like the pandas fix (#173177) missed that staging-next merge? does that mean it wont make release?

veprbl commented 2 years ago

looks like the pandas fix (#173177) missed that staging-next merge? does that mean it wont make release?

As a trivial fix to a few darwin non-builds with no linux rebuilds, it should have gone straight to master.

ncfavier commented 2 years ago

If I'm reading the release schedule correctly, there's still a staging-next iteration left before branch-off (assuming we're late on the schedule and not early).

neilmayhew commented 2 years ago

jl is now fixed (#168256)

Madouura commented 2 years ago

tinygo should be fixed with #157129

Ma27 commented 2 years ago

166817 fixes privacyidea and is nwo ready to review :)

trofi commented 2 years ago

Filed upstream bug as https://sourceware.org/PR29162 for gnat / glibc incompatibility.

schuelermine commented 2 years ago

I don’t see a filter option on the hydra page. How can I filter for failed jobs on my system?

Screenshot from 2022-05-20 20-23-20

vcunat commented 2 years ago

@schuelermine: that's the "search jobs by name" field. (you could directly edit the URL, too)

schuelermine commented 2 years ago

What’s the syntax for filters?

vcunat commented 2 years ago

None AFAIK. Contiguous substring, or how would I call the matching.

schuelermine commented 2 years ago

Oh, that’s unintuitive. I would expect “search jobs by name” to search by name only

dasJ commented 2 years ago

release-22.05 has been branched off so remember to also add the backport release-22.05 to your Pull Requests :tada:

cab404 commented 2 years ago

I've kinda put down a list of packages broken with stdenv update #zhfff https://gist.github.com/cab404/96259f25450d778e744108c0ea9bfaa8 it’s parsed from hydra outputs with smth like that

[ ...(document.querySelector("#tabs-now-fail > table:nth-child(1) > tbody:nth-child(2)").children) ]
.filter((e) => e.getElementsByClassName("build-status")[0].attributes["alt"].value === "Failed" )
.filter((e) => e.children[5].textContent === "x86_64-linux")
.map((r) => r.children[2].textContent)

these only include ones broken in this eval (1756238) and still broken in this (1763443)