FPtje commented 7 years ago

Problem description

Our private hydra build server (with @basvandijk) has had its database cleaned and currently holds one project with one jobset containing one (fresh) job. This job is put inside the Queue (as seen in Status > Queue), but it is never actually built.

When looking at the job's Summary page, the Status shows Scheduled to be built, part of evaluation 1. The build steps tab is empty. The build dependencies tab, though, shows a whole bunch of dependencies.

The Status > Latest steps page is empty, as is the Status > Latest builds. The Status > Running builds says there are no running builds. The machines in Machine status are shown as Idle.

Background

The issue has started since hydra was updated a month or so ago. In a debugging session (together with @basvandijk), the version was upgraded to revision de55303197d997c4fc5, where the issue still occurs. Jobs just wouldn't build. Only when actually running nix-store --realize <drv path> does hydra mark the job as finished.

Somewhat related to this issue is #431, in which @basvandijk describes some error with a Raspberry Pi 3 build server. He thought the Raspberry Pi stuff might cause Hydra to act this weird, but the one job currently active has nothing to do with the Raspberry Pi, which should exclude it as a cause.

Debug information

With all lvlChat print messages changed to lvlInfo (because we're too lazy to figure out how to set verbose mode), the following is printed in the journal when the configuration is switched (and hydra is started):

hydra-queue-runner[9884]: adding new machine ‘root@rpibuild3’
hydra-queue-runner[9884]: checking the queue for builds > 0...
hydra-queue-runner[9884]: loading build 1 (lumi:staging:test-stalling-net)
nix-daemon[1629]: accepted connection from pid 9884, user hydra-queue-runner (trusted)
hydra-queue-runner[9884]: status: {"status":"up","time":1481904748,"uptime":0,"pid":9884,"nrQueuedBuilds":0,"nrUnfinishedSteps":49,"nrRunnableSteps":0,"nrActiveSteps":0,"nrStepsBuilding":0,"nrStepsCopyingTo":0,"nrStepsCopyingFrom":0,"nrStepsWaiting":0,"bytesSent":0,"bytesReceived":0,"nrBuildsRead":0,"buildReadTimeMs":0,"buildReadTimeAvgMs":0,"nrBuildsDone":0,"nrStepsStarted":0,"nrStepsDone":0,"nrRetries":0,"maxNrRetries":0,"nrQueueWakeups":0,"nrDispatcherWakeups":3,"dispatchTimeMs":0,"dispatchTimeAvgMs":0,"nrDbConnections":2,"nrActiveDbUpdates":0,"memoryTokensInUse":0,"machines":{"root@rpibuild3":{"enabled":true,"currentJobs":0,"idleSince":0,"nrStepsDone":0,"disabledUntil":0,"lastFailure":0,"consecutiveFailures":0}},"jobsets":{"lumi:staging":{"shareUsed":0,"seconds":0}},"machineTypes":{},"store":{"narInfoRead":0,"narInfoReadAverted":0,"narInfoMissing":0,"narInfoWrite":0,"narInfoCacheSize":0,"narRead":0,"narReadBytes":0,"narReadCompressedBytes":0,"narWrite":0,"narWriteAverted":0,"narWriteBytes":0,"narWriteCompressedBytes":0,"narWriteCompressionTimeMs":0,"narCompressionSavings":0,"narCompressionSpeed":0}}
lumi-hydra-setup-start[9877]: updating existing user `xxxxxxx'
systemd[1]: Started lumi-hydra-setup.service.
nixos[9650]: finished switching to system configuration /nix/store/5b8d70z8x8kqkbdp55hn747kz2yp7paz-nixos-system-hera-nixos-16.09pre-git_lumi-60950208ee3b2250f46839f8f69027b3234f17a7
hydra-server[9882]: [warn] Unicode::Encoding plugin is auto-applied, please remove this from your appclass and make sure to define "encoding" config
hydra-queue-runner[9884]: added build 1 (top-level step /nix/store/9kf5k2gi719ysbqx7bnz6fnsvkhkawnw-<our-derivation>.drv, 2869 new steps)
hydra-queue-runner[9884]: got 77 new runnable steps from 1 new builds
hydra-queue-runner[9884]: step ‘/nix/store/05dqd3kryhlsg0pw8c3xxmwnwvc8d7fc-bash43-029.drv’ is now runnable

... (similar message for the other 76 new runnable steps)

hydra-queue-runner[9884]: Finished getting queued builds
hydra-server[9882]: DEPRECATION WARNING: The Regex dispatch type is deprecated.
hydra-server[9882]:   It is recommended that you convert Regex and LocalRegex
hydra-server[9882]:   methods to Chained methods. at /nix/store/0npj59p4gdzssrpf6p03j1pjgfi5xi3w-hydra-perl-deps/lib/perl5/site_perl/5.22.2/Catalyst/DispatchType/Regex.pm line 210.
hydra-server[9882]: 2016/12/16-17:12:29 Starman::Server (type Net::Server::PreFork) starting! pid(9882)
hydra-server[9882]: Resolved [*]:3000 to [0.0.0.0]:3000, IPv4
hydra-server[9882]: Binding to TCP port 3000 on host 0.0.0.0 with IPv4
hydra-server[9882]: Setting gid to "122 122 122"
hydra-server[9882]: Starman: Accepting connections at http://*:3000/
hydra-evaluator[9879]: starting evaluation of jobset ‘lumi:staging’
hydra-evaluator[9879]: considering jobset lumi:staging (last checked 1s ago)
nix-daemon[1629]: accepted connection from pid 9925, user hydra
hydra-evaluator[9879]:   jobset is unchanged, skipping
hydra-evaluator[9879]: evaluation of jobset ‘lumi:staging’ succeeded

Subsequently, this is what hydra tells us every now and then afterwards:

hydra-evaluator[9879]: starting evaluation of jobset ‘lumi:staging’
hydra-evaluator[9879]: considering jobset lumi:staging (last checked 1s ago)
nix-daemon[1629]: accepted connection from pid 9955, user hydra
hydra-evaluator[9879]:   jobset is unchanged, skipping
hydra-evaluator[9879]: evaluation of jobset ‘lumi:staging’ succeeded

Dec 16 17:50:12 hera hydra-queue-runner[11345]: status: {"status":"up","time":1481907012,"uptime":300,"pid":11345,"nrQueuedBuilds":1,"nrUnfinishedSteps":2869,"nrRunnableSteps":77,"nrActiveSteps":0,"nrStepsBuilding":0,"nrStepsCopyingTo":0,"nrStepsCopyingFrom":0,"nrStepsWaiting":0,"bytesSent":0,"bytesReceived":0,"nrBuildsRead":1,"buildReadTimeMs":622,"buildReadTimeAvgMs":622,"nrBuildsDone":0,"nrStepsStarted":0,"nrStepsDone":0,"nrRetries":0,"maxNrRetries":0,"nrQueueWakeups":0,"nrDispatcherWakeups":74,"dispatchTimeMs":0,"dispatchTimeAvgMs":0,"nrDbConnections":2,"nrActiveDbUpdates":0,"memoryTokensInUse":0,"machines":{"root@rpibuild3":{"enabled":true,"currentJobs":0,"idleSince":0,"nrStepsDone":0,"disabledUntil":0,"lastFailure":0,"consecutiveFailures":0}},"jobsets":{"lumi:staging":{"shareUsed":0,"seconds":0}},"machineTypes":{"x86_64-linux:local":{"runnable":77,"running":0,"waitTime":23100,"lastActive":0}},"store":{"narInfoRead":0,"narInfoReadAverted":0,"narInfoMissing":0,"narInfoWrite":0,"narInfoCacheSize":0,"narRead":0,"narReadBytes":0,"narReadCompressedBytes":0,"narWrite":0,"narWriteAverted":0,"narWriteBytes":0,"narWriteCompressedBytes":0,"narWriteCompressionTimeMs":0,"narCompressionSavings":0,"narCompressionSpeed":0}}

With some debug messages I figured out that hydra-queue-runner blocks on this line after mentioning that those 77 steps are "now runnable". From that point it never unblocks until the process is manually restarted.

So somehow the steps are marked as "runnable", but nothing is actually built. Restarting hydra doesn't change anything and just reproduces the exact same output as posted above.

Further debugging

I'm actually kind of lost on ideas as to what could cause this. I'll update the issue with more info as I continue, but where should I even look?

In my current setup it's easy to put debug prints in arbitrary places in the code of hydra, if necessary.

FPtje commented 7 years ago

Update:

I took a look at these two lines and added the following debug prints:

printMsg(lvlError, format("Can this happen!?"));
if (!build->finishedInDB) { // FIXME: can this happen?
     printMsg(lvlError, format("YES, THIS CAN HAPPEN"));
     (*builds_)[build->id] = build;
}

2016-12-16-210408_673x65_scrot

I'm glad to be of service and answer the question in the comment, but I'm not yet sure what it means.

This is the contents of the builds table in in PostgreSQL, queried after the above message was observed:

-[ RECORD 1 ]--+-------------------------------------------------------------------------
id             | 1
finished       | 0
timestamp      | 1481897112
project        | lumi
jobset         | staging
job            | <job name>
nixname        | <our job's nix name>
description    | 
drvpath        | /nix/store/9kf5k2gi719ysbqx7bnz6fnsvkhkaqnw-<our-derivation>.drv
system         | x86_64-linux
license        | 
homepage       | 
maintainers    | 
maxsilent      | 7200
timeout        | 36000
ischannel      | 0
iscurrent      | 1
nixexprinput   | lumi
nixexprpath    | release.nix
priority       | 100
globalpriority | 0
starttime      | 
stoptime       | 
iscachedbuild  | 
buildstatus    | 
size           | 
closuresize    | 
releasename    | 
keep           | 0

domenkozar commented 7 years ago

@edolstra

phile314-fh commented 7 years ago

I see the exact same problem on our hydra instance. Hydra runs on and builds for a normal intel 64bit machine, no cross-compilation is involved. We are using Hydra version de55303197d997c4fc5503b52b1321ae9528583d .

cyraxjoe commented 7 years ago

I'm in a similar situation and currently I suspect that the problem is the nix-daemon that I'm using from the nix in the system (1.11.4) trying to evaluate the builds, @FPtje which version of nix are you using in the system?

FPtje commented 7 years ago

1.11.7

cyraxjoe commented 7 years ago

@FPtje the problem that I had, has a much simpler explanation. The hydra-queue-runner wasn't able to find the ssh executable and the jobs were never evaluated. The problem was introduced because I was migrating the service to systemd and the PATH is not inherited by default into the environment. This is in an Ubuntu 16.04 server.

basvandijk commented 7 years ago

Due to a tip from @expipiplus1 I managed to fix our long standing issue with hydra not dequeueing certain jobs.

Note that I have always configured hydra as follows:

{ services.hydra.buildMachinesFiles = mkIf (config.nix.buildMachines == []) []; }

If you don't explicitly set services.hydra.buildMachinesFiles the hydra NixOS module will default to /etc/nix/machines which doesn't exist if you don specify any nix.buildMachines. (Also see https://github.com/NixOS/hydra/pull/432).

In our case we don't set nix.buildMachines meaning that services.hydra.buildMachinesFiles = []; I discovered today that in this case hydra will specify localhost as a default build machine. However by default no supported system features like "kvm" or "nixos-test" are specified.

I suspect that the jobs that fail to dequeue have dependencies with certain requiredSystemFeatures which are not supported by the default localhost build machine.

The solution is to explicitly specify localhost as a build machine with the required system features:

{
  nix.buildMachines = [
    { hostName = "localhost";
      system = "x86_64-linux";
      supportedFeatures = ["kvm" "nixos-test" "big-parallel" "benchmark"];
      maxJobs = 8;
    }
  ];
}

We should probably document this behaviour in the description of the services.hydra.buildMachinesFiles option so other users don't have to go to the trouble we had to go through.

NixOS / hydra

Failure to actually run the builds in the queue #433

Problem description

Background

Debug information

Further debugging