Open FPtje opened 7 years ago
Update:
I took a look at these two lines and added the following debug prints:
printMsg(lvlError, format("Can this happen!?"));
if (!build->finishedInDB) { // FIXME: can this happen?
printMsg(lvlError, format("YES, THIS CAN HAPPEN"));
(*builds_)[build->id] = build;
}
I'm glad to be of service and answer the question in the comment, but I'm not yet sure what it means.
This is the contents of the builds
table in in PostgreSQL, queried after the above message was observed:
-[ RECORD 1 ]--+-------------------------------------------------------------------------
id | 1
finished | 0
timestamp | 1481897112
project | lumi
jobset | staging
job | <job name>
nixname | <our job's nix name>
description |
drvpath | /nix/store/9kf5k2gi719ysbqx7bnz6fnsvkhkaqnw-<our-derivation>.drv
system | x86_64-linux
license |
homepage |
maintainers |
maxsilent | 7200
timeout | 36000
ischannel | 0
iscurrent | 1
nixexprinput | lumi
nixexprpath | release.nix
priority | 100
globalpriority | 0
starttime |
stoptime |
iscachedbuild |
buildstatus |
size |
closuresize |
releasename |
keep | 0
@edolstra
I see the exact same problem on our hydra instance. Hydra runs on and builds for a normal intel 64bit machine, no cross-compilation is involved. We are using Hydra version de55303197d997c4fc5503b52b1321ae9528583d .
I'm in a similar situation and currently I suspect that the problem is the nix-daemon
that I'm using from the nix in the system (1.11.4
) trying to evaluate the builds, @FPtje which version of nix are you using in the system?
1.11.7
@FPtje the problem that I had, has a much simpler explanation. The hydra-queue-runner
wasn't able to find the ssh
executable and the jobs were never evaluated. The problem was introduced because I was migrating the service to systemd and the PATH
is not inherited by default into the environment. This is in an Ubuntu 16.04 server.
Due to a tip from @expipiplus1 I managed to fix our long standing issue with hydra not dequeueing certain jobs.
Note that I have always configured hydra as follows:
{ services.hydra.buildMachinesFiles = mkIf (config.nix.buildMachines == []) []; }
If you don't explicitly set services.hydra.buildMachinesFiles
the hydra NixOS module will default to /etc/nix/machines
which doesn't exist if you don specify any nix.buildMachines
. (Also see https://github.com/NixOS/hydra/pull/432).
In our case we don't set nix.buildMachines
meaning that services.hydra.buildMachinesFiles = []
; I discovered today that in this case hydra will specify localhost
as a default build machine. However by default no supported system features like "kvm" or "nixos-test" are specified.
I suspect that the jobs that fail to dequeue have dependencies with certain requiredSystemFeatures
which are not supported by the default localhost
build machine.
The solution is to explicitly specify localhost
as a build machine with the required system features:
{
nix.buildMachines = [
{ hostName = "localhost";
system = "x86_64-linux";
supportedFeatures = ["kvm" "nixos-test" "big-parallel" "benchmark"];
maxJobs = 8;
}
];
}
We should probably document this behaviour in the description of the services.hydra.buildMachinesFiles
option so other users don't have to go to the trouble we had to go through.
Problem description
Our private hydra build server (with @basvandijk) has had its database cleaned and currently holds one project with one jobset containing one (fresh) job. This job is put inside the Queue (as seen in
Status > Queue
), but it is never actually built.When looking at the job's
Summary
page, theStatus
showsScheduled to be built
, part of evaluation 1. The build steps tab is empty. The build dependencies tab, though, shows a whole bunch of dependencies.The
Status > Latest steps
page is empty, as is theStatus > Latest builds
. TheStatus > Running builds
says there are no running builds. The machines inMachine status
are shown as Idle.Background
The issue has started since hydra was updated a month or so ago. In a debugging session (together with @basvandijk), the version was upgraded to revision
de55303197d997c4fc5
, where the issue still occurs. Jobs just wouldn't build. Only when actually runningnix-store --realize <drv path>
does hydra mark the job as finished.Somewhat related to this issue is #431, in which @basvandijk describes some error with a Raspberry Pi 3 build server. He thought the Raspberry Pi stuff might cause Hydra to act this weird, but the one job currently active has nothing to do with the Raspberry Pi, which should exclude it as a cause.
Debug information
With all
lvlChat
print messages changed tolvlInfo
(because we're too lazy to figure out how to set verbose mode), the following is printed in the journal when the configuration is switched (and hydra is started):Subsequently, this is what hydra tells us every now and then afterwards:
With some debug messages I figured out that
hydra-queue-runner
blocks on this line after mentioning that those 77 steps are "now runnable". From that point it never unblocks until the process is manually restarted.So somehow the steps are marked as "runnable", but nothing is actually built. Restarting
hydra
doesn't change anything and just reproduces the exact same output as posted above.Further debugging
I'm actually kind of lost on ideas as to what could cause this. I'll update the issue with more info as I continue, but where should I even look?
In my current setup it's easy to put debug prints in arbitrary places in the code of hydra, if necessary.