Closed edolstra closed 7 years ago
For now, what about increasing the evaluation timeout? The killed evals seem a waste of resources; better do them less often and give them more time.
The Haskell and R package sets have gained some 7k packages per architecture since release-14.12. I'm sure that we don't need the vast majority of those builds. It's just that nobody really knows which builds we need and which ones we don't.
I see even "trunk" is getting these timeouts http://hydra.nixos.org/jobset/nixpkgs/trunk#tabs-errors; perhaps not so often, but it's three days without evaluation ATM.
Let's have a look at current master 7dba3bafba, x86_64-linux only. The large sets I can see as reported by nix-env -qa
(simplified; package counts in thousands):
11.7 haskellPackages
8.8 rPackages
4.0 ^python and ^pypy
3.0 ^emacs
1.0 ^perl
0.6 ^go1[45]
0.4 ^linux
0.5 ^kde5
0.3 ^kde4
0.3 ^gnome3
The above amounts for ~60k packages of trunk-combined (due to it having two platforms). Note that Haskell+R make together more than a half of all jobs in trunk-combined.
I can think of splitting these large sets into separate jobsets and only keeping a few important ones in the default sets, but that would also have some disadvantages, e.g. one couldn't easily see build-status changes around some time for all of the packages Hydra builds.
We can reduce the size of the jobset now, but it will inevitably begin to grow again. Should we be investigating ways to speed up evaluation? Is the bottleneck the Nix interpreter, or I/O? I suspect the latter because hydra-eval-jobs
runs very quickly on my SSD-based laptop, but I don't have conclusive data.
I was just throwing around the idea the other day that even if the main Hydra box doesn't have an SSD, still has lots of memory, right? If so, it could put its git clones onto a tmpfs prior to evaluation. That should make evaluation even faster than an SSD.
I/O, quite certainly... to write all those *.drv files and register them into nix store.
Personally, I wouldn't store *.drv on drive at all and instead regenerate them on-demand only in memory as IMHO there's little benefit from caching them, but that would be a nontrivial change in nix's core.
there's little benefit from caching them, but that would be a nontrivial change in nix's core.
Would this essentially require unifying nix-instantiate
and nix-store
? I'm not saying that's a reason not to do it; I rather like this idea.
I'm sorry for this off-topic.
Yes... normally you always first nix-instantiate
and then nix-store --realize
with .drv passing from one to the other through nix store – instead they could by default be handed over in a shared in-memory data structure (both tools are in c++ already). IIRC remote builds already do support some reduction around .drv.
I suppose we could disable the builds for R since those packages tend to compile very quickly, so it shouldn't be too bad for our users if they have to build them themselves.
The Haskell package set could probably be cut down to just those packages that are part of Stackage; that would save approx. 5k builds per platform, i.e. another 15k builds less. People will notice that, however, because many important packages are not part of Stackage and would thus no longer be built.
Likewise, the Emacs packages aren't really compiled at all; we could remove them from the jobset with minimal impact.
That was done in #12365 (perhaps partially) and as a result we have ~18k nixpkgs jobs less.
@vcunat That's a little misleading. I have only removed the duplicate jobs that came about as a result of emacsPackagesNg
having several names for legacy reasons. Hydra is still building all the Emacs packages.
I was aware of that, but I'm not certain what exactly slows the evaluation down.
Soon I'm also going to start generating the Emacs package expressions directly, instead of generating JSON. That will be a small improvement, at least, because the JSON undergoes some processing after it's imported. (That's not the reason I'm changing the package format; it's just a side effect.)
If so, it could put its git clones onto a tmpfs prior to evaluation.
Let me second this opinion. My own experiments indicate that it could improve evaluation times by up to two orders of magnitude. Hydra only needs a shallow clone for the evaluation, so it wouldn't even take up that much memory.
Here are my numbers for anyone interested. This is just one sample; I doubt it's the last word on evaluation performance, but there is clear evidence Hydra would benefit from evaluating expressions from RAM.
@edolstra we could do tmpfs
for hydra checkout right? It's worth a shot as the next optimization as @ttuegel suggests
@domenkozar The problem is not reading the Nix expressions, it's writing the .drv files to disk. (There is also some slowness on the PostgreSQL side, but that's probably not too bad.)
Plot of nixpkgs:trunk
evaluation time:
By comparison, this takes only 230s on my home computer (with SSD).
Number of jobs in the nixpkgs:trunk
jobset:
@edolstra how do you know write performance is the key factor here? I think @ttuegel thought that too until he ran some tests and found that read performance was a major factor On Wed, Feb 10, 2016 at 09:09 Eelco Dolstra notifications@github.com wrote:
@domenkozar https://github.com/domenkozar The problem is not reading the Nix expressions, it's writing the .drv files to disk. (There is also some slowness on the PostgreSQL side, but that's probably not too bad.)
Plot of nixpkgs:trunk evaluation time:
[image: nixpkgs-trunk-eval-time] https://cloud.githubusercontent.com/assets/1148549/12949237/1595eeb6-d007-11e5-8d77-ea89eafcdd92.png
By comparison, this takes only 230s on my home computer (with SSD).
Number of jobs in the nixpkgs:trunk jobset:
[image: nixpkgs-trunk-size] https://cloud.githubusercontent.com/assets/1148549/12948838/e568e754-d004-11e5-8cdd-bbd824d37108.png
— Reply to this email directly or view it on GitHub https://github.com/NixOS/nixpkgs/issues/12203#issuecomment-182390701.
So really two things to do:
@domenkozar Yeah, that's the plan. In the interim, I'll reduce the number of jobs.
@ttuegel thought that too until he ran some tests and found that read performance was a major factor
On my spinning-disk machine, I got 2 orders of magnitude speed-up (consistent over multiple trials) just from moving my Nixpkgs clone onto tmpfs.
Yeah, reading from non-SSD doesn't help matters either, but given the size of /var/lib/hydra/scm
, putting it on a tmpfs is probably not an option.
@edolstra: (why) do we need to generate *.drv during hydra's eval? I would think it enough that you know the sources and attr-path, so instantiation could be postponed when each individual job is processed in the queue. IIRC you said there is a first pass through the queue including a check that we don't have the outputs already, so that would seem a proper place to instantiate.
Yeah, in principle the queue runner could re-evaluate the job (basically do nix-instantiate -A jobname
). However, that would require each job to be evaluated twice, and keeping the source trees around until the last job has been built.
I assume the source tree checkouts aren't handled like derivations (yet)? I imagine calling fetchgit
or similar and then using the result from nix store. Then one wouldn't even really need to track whether anything in the queue still depends on it, because it would just be re-fetched instead of re-using the derivation output...
What is the size of the build? If it's too large to economically put on SSD, a hybrid filesystem (e.g. ZFS w/SLOG and L2ARC) gets you much higher performance than a HDD on its own.
But I have a four-core machine with dual SSDs sitting mostly idle. I'd like to try running a build on that, and a hybrid server for comparison, if there are any idiot-proof instructions anywhere.
It'd also be great if you could run Prometheus' node_exporter on the Hydra machine. I'd be happy to help set that up, if you want; doesn't need root access. (Or I could talk you through it on IRC.)
if you want to test hydra performance without fully configuring hydra, just try these 2 commands
nix-build '<nixpkgs>' -A hydra
./result/bin/hydra-eval-jobs -I ~/nixpkgs/ ~/nixpkgs/pkgs/top-level/release.nix
it will spit out a giant blob of json, chew up 4gig of ram, and write .drv files for every single package to the store
edit: and avoid doing it on btrfs, it appears to have overloaded the system by creating 30,000 files and just outright crashing the fs
I believe the situation has improved much during the last few weeks, probably due to using binary cache to store results directly instead of the central machine. (I think that change has happened but I don't have any definite proof.)
Now an evaluation seems to typically take hundreds of seconds (even a mass-rebuild one) compared to the previous thousands.
In the interim, I'll reduce the number of jobs.
Does it mean we can now increase the number of jobs significantly again?
I'd be curious to know what the state of hydra.nixos.org
is these days. Have the optimizations discussed earlier been implemented? Does it copy closures into the cache directly after they've been built by a slave?
I believe it does, and the (real) time needed for evaluations dropped ~10x, probably as a result of reduced I/O load.
Two things were implemented by Eelco and Rob:
I've had a quick discussion with Eelco on #nixos channel about nixpkgs-complete
where it would build all the language packages every few days. This way it's not that intensive for Hydra but the packages still don't bitrot.
Closing this issue due to inactivity, and the seemingly fine performance of Hydra since April.
The job counts were cut down, too, IIRC. (I can't verify now as Hydra seems down ATM.)
The Nixpkgs/NixOS jobsets are getting really huge. For example:
Due to this, Nixpkgs evaluation is timing out (e.g. the staging jobset cannot be evaluated anymore: http://hydra.nixos.org/jobset/nixpkgs/staging#tabs-errors). Also, it delays channel updates - e.g. the nixos:trunk-combined channel cannot be updated until ~70K packages have been built.
It would be good to know where this growth is coming from, and do some trimming if appropriate.