Hydra

lukego commented 2 years ago

This is experimental work-in-progress but I am running a Hydra (Nix CI) instance with builds of the Lisp packages: https://hydra.nuddy.co/

I started last month with one 16-core Ryzen3 CPU running Linux/x86-64 builds.

This week I added an 80-core Ampere ARM64 build machine and just this moment kicked off an experimental larger build of {sbcl,ecl,abcl} on {x86-64, i686, arm64} at https://github.com/NixOS/nixpkgs/pull/193754#issuecomment-1272877277.

The next machine I add will be a Mac M1. Then we should have quite good cross-platform test coverage for Lisp packages.

The tests are very basic right now, and only served up in the raw Hydra webUI, but the intention is to also generate some human-readable reports and e.g. to directly monitor some key projects like SBCL to test changes before they are released.

If anyone wants to collaborate on this, e.g. to have Hydra test some branches of their own for Lisp packages, just leave a comment and I will try to help. The big idea is just to efficiently find and fix problems in Lisp libraries (or their Nix packagings.)

Uthar commented 2 years ago

It's very nice of you to set up this infrastructure for free

Could you add this branch? https://github.com/Uthar/nixpkgs/tree/fixes

I fixed a bunch of packages there, but not created a PR yet

lukego commented 2 years ago

It's very nice of you to set up this infrastructure for free

Thanks. I'm planning to reuse this infrastructure for multiple purposes so starting the Nix/Lisp packages is a good exercise to get it setup while also being useful to the Lisp community.

Could you add this branch? https://github.com/Uthar/nixpkgs/tree/fixes

I have added the job:

https://hydra.nuddy.co/jobset/nixpkgs-lisp/sbclPackages-Uthar-fixes#tabs-evaluations

However you will need to clone the lispnix repo because I pointed that at your Github url: https://hydra.nuddy.co/jobset/nixpkgs-lisp/sbclPackages-Uthar-fixes#tabs-configuration

This way you can decide what Hydra will test by pushing changes to the sbclPackages.nix file on your lispnix branch. Hydra is polling the relevant Git repos every 5 minutes.

I think there is a better way to do all of this, for example using Nix Flakes to define the CI jobs, but one step at a time...

lukego commented 2 years ago

I'm having a lot of errors on my builds at the moment. I'm not sure if it's a setup issue in Hydra. I'll troubleshoot this tomorrow :)

lukego commented 2 years ago

(I'm still troubleshooting :))

Uthar commented 2 years ago

enjoy :)

(I'm still troubleshooting :))

Uthar commented 2 years ago

Clasp 2.0.0 was released yesterday. It can now save-lisp-and-die into an executable, which starts very fast! I packaged it here as github:uthar/dev#clasp, maybe it's nice to include it in the hydra builds? I'll add it to the github actions in nix-cl

lukego commented 2 years ago

Great idea re: Clasp!

Sorry I've been distracted lately: traveling to NixCon last week and then doing other work this week. Quick update:

Hydra is running at http://ex43-0.nuddy.co/. I need to troubleshoot why the hydra.nuddy.co hostname isn't working: seems to be an SSL-related issue from when I recently moved the Hydra installation from EC2 to a Hetzner machine.

Hydra jobs

Hydra has three ways to define jobs and I've now cycled through all of them :)

Manually in the UI. Simple but awkward and requires provisioning accounts.
Synchronized via Terraform. Less awkward but more centralized.
Declarative jobsets. Hopefully easier for collaboration: Hydra fetches Git repos and evaluates Nix expressions to define the jobs to run. Currently I have a test jobset setup at https://github.com/nuddyco/nuddy-hydra-jobsets but I really need to define a proper one.

@Uthar if you want to experiment with Hydra you can clone that jobsets repo ^ and I can have Hydra poll your fork too.

Lisp package state

I'd like to have a Hydra job that keeps track of the exact status of every package for every Lisp. That is loosely state = <working,test-error,load-error,build-error,dependency-error>.

This is a bit tricky because Nix doesn't support first class errors and has a hard time representing e.g. a derivation that failed to build due to broken dependencies. But I was at NixCon last week and I took the chance to discuss this with Eelo Dolstra. He thought it sounded OK to run Nix recursively for this i.e. an inner nix-build can fail but it is detected and tolerated by the outer nix-build that Hydra is running.

I have tried to implement that but so far recursive Nix doesn't work for me as it does for others: https://github.com/NixOS/nix/pull/3205#issuecomment-1290041171

I know that an alternative approach is to extend the Lisp code that calls out to nix-build, i.e. use nix-shell as the outer Nix layer instead of nix-build, but when I started doing that it felt a bit like I would end up reimplementing Hydra eventually so I'd like to try harder to keep it all within a Nix build.

I might be on totally the wrong track here though...

Uthar commented 2 years ago

Ahh I wish I knew, I would go to the Nix Con too (-:

lukego commented 2 years ago

@Uthar Rumour has it that FOSDEM 2023 will have a Nix devroom (hopefully.) I'm going: maybe I'll meet you there :-) https://fosdem.org/2023/

Uthar commented 2 years ago

Where can I learn more about recursive nix build? Maybe it will be useful to implement checkPhase for lisp packages, because their tests exist in separate packages. Or there is something like testInputs in mkDerivation?

lukego commented 2 years ago

I don't think recursive nix builds are really documented properly but you can see some of the machinery at e.g. https://git.alternativebit.fr/NinjaTrappeur/Nix/commit/c4d7c76b641d82b2696fef73ce0ac160043c18da?style=unified&whitespace=ignore-all

The idea is simple i.e. you call nix-build from inside a nix build. The implementation is more tricky e.g. making sure the right store paths are transported in/out of the build sandbox. But it seems to have transitioned from "something that basically does not work" into "something that basically does actually work."

Here is the derivation that I have now: https://github.com/nuddyco/lispnix/blob/main/try-build.nix

This attempts to use an inner nix build to build one Lisp package from inside an outer lisp build that is making a list of packages that do/don't build. The recursive nix here is basically try-catch: the outer build can succeed (producing log files as output) even if the inner builds fail (because some package is broken or has broken dependencies.)

Uthar commented 2 years ago

New QL release 2022-11-07

Would be cool to compare hydra results with last release

lukego commented 2 years ago

Hydra doesn't seem to be very happy at the moment. It's a bit temperamental.

The job queue was getting stuck almost immediately and I suspect it's this issue: https://github.com/NixOS/nix/issues/6981

and I tried working around that by downgrading from master to release-22.05 but now the queue is running smoothly but seemingly without the recursive nix stuff producing the right results (?).

I suspect that I should be deploying Hydra using Flakes and getting a copy of exactly the flake.lock file that pins everything to the right versions from somebody in the know... but the admin/setup/troubleshooting side of Hydra is not much fun so I will have to come back after doing some other things!

lukego commented 2 years ago

Here's the latest on my struggles with recursive nix: https://github.com/NixOS/nix/issues/7276. It seems to work except that we don't get all the log output that we need to diagnose build failures.

lukego commented 1 year ago

Maybe I have made peace with recursive nix now: https://github.com/NixOS/nix/issues/7276#issuecomment-1311570416

This is really brain-stretchy stuff so I'll need a short break before I try applying this to building a catalogue of working/broken Lis packages :)

Uthar commented 1 year ago

I'd like to start learning hydra, but don;t know where to start, the official documentation I found lacking. Do you recommend some resources? Does it make sense to go straight into declarative flakes, skipping the clicky click stuff?

lukego commented 1 year ago

Honestly my current opinion is that it's better to write and debug all the Nix code locally and then as a last step to migrate it to Hydra.

I will let you know when I have a good example.

lukego commented 1 year ago

Seems that some of my "Hydra problems" are really just recursive-nix problems: https://github.com/NixOS/nix/issues/7297

But for now I still think it is worth being patient and pushing ahead with recursive-nix for building software test reports e.g. keeping track of which builds fail and why. I don't really like the alternatives e.g. tracking it only in Hydra or with homebrew scripts running nix-shell.

lukego commented 1 year ago

I'm "flake-ifying" the Hydra setup today. I know I know I am late to the flakes party.

First I redeployed Hydra using the flake github:nixos/hydra instead some random version from nixpkgs. I hope this means that I'm running all the exact same pinned software as the upstream Hydra and that this will help with reliability and troubleshooting.

I also tried defining a Hydra jobset using a flake. It seems to work well and is easy to setup. Just have to point Hydra at a repository containing a flake.nix that produces a hydraJobs attribute saying which derivations to build.

Here's an example Hydra evaluation of the sbcl packages on x86_64-linux and aarch64-linux: http://ex43-0.nuddy.co/eval/1429

This is from a fork of this nix-cl repo with two tweaks:

Added hydraJobs = packages.sbcl.pkgs;
Pruned outputs to only build Linux-based platforms, because today I wasn't smart enough to see a simple way to filter out the non-working macOS derivations from hydraJobs otherwise (my brain is still processing flakes in general.)

@Uthar I also pointed Hydra at this upstream repo so if you want to try putting some derivations on the hydraJobs attribute then they should get built automatically over at http://ex43-0.nuddy.co/jobset/nix-cl/nix-cl. The page is showing an error now only because there is no hydraJobs attribute on the flake.

Uthar commented 1 year ago

Hah, better sooner than later

Sounds really good with the Hydra stuff. I'm excited to try it out - there's a lot of work to do, 1000 failing packages, right?

lukego commented 1 year ago

1000 failing packages, right?

Sort: that's combined for x86_64-linux and aarch64-linux. So maybe it's 500 packages each failing on both.

I'd like to generate a table / CSV file with columns:

package
arch
success?
missing_library (e.g. libcrypto.so based on reading logs)
full_logs_path_or_url

I'd imagined building this using recursive nix. That is, the outer-build runs one inner-build for each package and collects the results in a table. (I don't think you can do that with normal nix code because no try-catch mechanism on builds.) However this might be a dead-end because recursive nix has been reliably freezing my nix daemon (https://github.com/NixOS/nix/issues/7297).

So how do you think we should build a report like that? Seems like options include recursive nix (above), downloading results via the hydra api, running tests in nix-shell instead of hydra to collect logs, or...?

btw it would be interesting to know if you can reproduce the recursive nix problem cited above. Just in case there is something weird about my setup and it's not really a nix bug.

Uthar commented 1 year ago

Ah makes sense, yes

I think we could search for BUILD FAILED near the end. Currently the builder catches and prints the build error like that: https://github.com/Uthar/nix-cl/blob/aee754f47672fe4ade6c9da1cbde2a31e3a08e0f/builder.lisp#L4-L12

We could collapse all the newlines in this message to make this easier to get - just read the last line

For the package, arch, success, we can just take from the Nix build, but I'l not sure about link to logs

lukego commented 1 year ago

Progress!

I now have a flake that builds all Lisp packages (currently for sbcl on x86_64-linux), collects the logs (including failures), and makes a table of results (currently raw CSV with limited details.) It's hooked up to Hydra.

This should be a good starting point. Going forward we need to test on more platforms, and extract more columns for the table, and present the results in more interesting ways.

Links:

Build with link to CSV data: http://ex43-0.nuddy.co/build/237763
Flake: https://github.com/lukego/nix-cl-report/blob/master/flake.nix
Supporting code for logging builds that succeeded, failed, or were aborted due to failed dependencies: https://github.com/lukego/nix-cl-report/blob/master/withBuildLog.nix

I had to do quite some kludgery to trap failed builds and convert them into logs. Not the end of the world but I think in the future recursive nix could handle this much better.

Uthar commented 1 year ago

Congrats (-: So only 236 failing systems We could test MacOS/Linux, GCC/Clang, JDK versions, ASDF versions, Glibc/Musl - it's quite the matrix Maybe also test if SBCL bootstrapped from each other implementation behaves the same

Uthar commented 1 year ago

Other things:

Let's add a way to easily build static executables using https://www.timmons.dev/posts/static-executables-with-sbcl-v2.html
Let's add a way to easily build Clasp with additional C++ libraries, such as... Nix. https://clasp-developers.github.io/clbind-doc.html

It would be so cool to manipulate the store and derivation in Common Lisp.

lukego commented 1 year ago

I've extended the flake a bit: here's a recent build.

This links two artifacts:

report.csv now contains 36K rows.
report.png is a first example R/ggplot2 from the data.

Currently it's testing this test matrix:

        { lisp = "sbcl";  system = "x86_64-linux";  }
        { lisp = "clasp"; system = "x86_64-linux";  }
        { lisp = "ccl";   system = "x86_64-linux";  }
        { lisp = "abcl";  system = "x86_64-linux";  }
        { lisp = "ecl";   system = "x86_64-linux";  }
        { lisp = "sbcl";  system = "aarch64-linux";  }
        #{ lisp = "clasp"; system = "aarch64-linux";  }
        #{ lisp = "ccl";   system = "aarch64-linux";  }
        { lisp = "abcl";  system = "aarch64-linux";  }
        { lisp = "ecl";   system = "aarch64-linux";  }

I disabled clasp and ccl on aarch64-linux due to some "unsupported platform" errors but I didn't dig too deep.

Here's the very basic example pic:

summary

Uthar commented 1 year ago

Wow! that is super cool. I'll look into packaging Clasp for arm. I think CCL does not run on arm 64 bit. There's also CLISP we should test some day

lukego commented 1 year ago

I exposed the aggregated logs from all the builds now: http://hydra.nuddy.co/build/282896

So now we have ~5M lines of output from the builds to help understand why they don't all work :grin:

I'd like to categorize these errors and indicate them in the CSV data. Then we could e.g. detect when packages fail on different implementations for different reasons and so on. But what should the categories be?

Here's a quick regex hack look at common error messages:

 zcat lisp-build-logs.txt.gz | ~/git/nix-cl-report/scanner.awk | sed 's/^OTHER: .*/(other)/' | sort | uniq -c | sort -nr
    725 (other)
    434 unable-to-load-any-of-the-alternatives
    268 component-not-found
    223 unable-to-load-foreign-library
    157 subprocess
    127 unable-to-open
     98 filesystem-error-with-pathname
     82 load-definition-for-system
     70 error-opening
     67 cant-create-directory
     44 package-cant-be-found
     41 variable-is-unbound
     40 value-not-of-type
     38 permission-denied
     30 no-package-named
     25 slot-is-unbound
     18 package-does-not-exist
     13 wrong-number-of-arguments
     13 no-applicable-method
     12 unrecognized-character-name
     11 value-not-expected-type
     11 lisp-does-not-support-weak-hash-tables
     10 java-exception
     10 couldnt-execute
...

The top few lines make me think there is a lot of low-hanging fruit in terms of missing dependencies. Have to think about an efficient way to fix all of those and keep them fixed.

lukego commented 1 year ago

Good news maybe: I looked into why my "jumbo dependency" builds are failing and the main error reason is still foreign libraries:

error-variant

So it looks like I need to debug the inject-jumbo-dependencies logic to eliminate those errors before we know how serious the remaining errors are.

lukego commented 1 year ago

JFYI: I added a Mac Mini M1 build slave to the Hydra now. It's a bare metal device sitting in my home office unlike the other machines that are all hosted at Hetzner.

Uthar / nix-cl

Hydra #13

Hydra

Hydra jobs

Lisp package state