Open domenkozar opened 9 years ago
Maybe restarting the build should also delete previous result?
I don't think this is an issue, it's expected behavior and how should work. Can we close?
It's not an issue because it follows the design, but it's an usability problem.
I think 2 issues are being confused here. A works-like-designed evaluation issue, which has nothing to do with not being able to restart a build that already is built.
Please create a new issue for the latter.
Well, "works as designed" can still be an issue if the design is wrong :-)
In this case, the design did not take fixed-output derivations into account.
Feel free to reopen it if you think this should be fixed.
Btw, I still don't see why evaluator would create a new build if the result is already there. The restart issue is genuine, that's why a new issue should be created for that.
It's not about restarting: you need a new derivation to be instantiated if for instance the URL in a fetchurl call changed.
Ah, then we can just add an extra check on drvpath right, in stead of just checking the out path?
That's the easy fix, but the reason why we don't do that is to prevent the creation of tens of thousands of new builds every time somebody changes the fetchurl mirrors list (for example).
Fixed by ixpkgs/commit/13dcb3523309c318e8dce59fae168ea39f98065c?
No, that's about the difficulty in restarting builds that failed with output. This issue is about builds whose derivation changed but not the output path.
This really confused me a while ago – seems like a wrong fix to this issue was used.
Hydra.nixos.org re-queued ~30k builds in nixpkgs/trunk even though it has successfully built the same results already a few days ago in nixpkgs/staging. They have equal output store paths, though the derivations differ due to a change in mirror list https://github.com/NixOS/nixpkgs/commit/b80a497b85 (I pushed that one, incidentally). Example pair:
After that it deals with the duplicate builds relatively quickly, on the order of a hundred per minute, but I don't think it should queue them at all.
Now it even re-queues builds of the same derivations (equal *.drv paths) from a different jobset. Example corresponding to the above: http://hydra.nixos.org/build/23844548#tabs-details Note e.g. that the re-queuing in this case happened ~6 minutes after a build of the same derivation finished successfully.
That seems rather wasteful in the long term, given by us having many duplicities among jobsets (nixpkgs, nixos, etc.).
Why is it wasteful?
Previously the evaluator had a check for cached builds, but now the queue runner does that.
Why is it wasteful?
I see, so evaluation is faster now. It was one mirror change and Hydra was clearing this duplicate ~30k builds for about an hour, doing nothing else, and then the some work again for the other jobset (not sure if another hour or faster, derivations being equal now) – that didn't seem efficient for such a powerful set of machines, but maybe the work is just more visible now.
So is the core of work in this case be done by nix-store --realize
in the queue-runner, i.e. mainly sqlite queries on the validity of derivation outputs? I expect there's a difference that evaluation-time work did in no way block the real building work, assuming there was something else in the queue and evaluation was done on a separate machine. There might be overhead due to adding into queue also results into DB insertion (not just reading), at least I suppose so from the front-end presentation, but I know too little of hydra workings to be a good judge.
The new queue runner is multithreaded. There is one thread that processes changes to the queue and another that starts build steps. The first thread is the one that checks whether builds have already been done and if so marks them as cached builds in the database. This does not block the dispatcher thread from starting build steps.
The one downside is that the cached builds in a derivation are no longer marked as finished in a single transaction, as was previously the case. OTOH, it means that the queue runner doesn't have to wait a long time for uncached builds to become visible/buildable.
Hmm, yes, with this superficial information I know now, it doesn't look wasteful anymore.
Restarting a build with changes like https://github.com/NixOS/nixpkgs/commit/6e726e1348a4ab130b7c531b784b62b6dfa3dbb6 is a noop.