No step created for build

Yesterday many builds on our Hydra got queued but didn't get build. The builds from later evaluations of the same jobsets did get build successfully.

I'm now trying to understand what went wrong in Hydra to cause this to happen. I haven't restarted the hydra-queue-runner yet to preserve as much state as possible.

Let's look at one of the builds that are still queued: build 558937. Note this is a top-level build; no other builds depend on this one.

Let's first look at the log:

...
Jan 21 22:39:34 hydra hydra-evaluator[9874]: added build 558937 (...)
...
Jan 21 22:39:35 hydra hydra-queue-runner[9894]: got notification: new builds added to the queue
Jan 21 22:39:35 hydra hydra-queue-runner[9894]: checking the queue for builds > 558932...
...
Jan 21 22:39:35 hydra hydra-queue-runner[9894]: loading build 558933 (...)
...
Jan 21 22:39:35 hydra hydra-queue-runner[9894]: loading build 558934 (...)
...
Jan 21 22:39:35 hydra hydra-queue-runner[9894]: loading build 559010 (...)
...
Jan 21 22:39:37 hydra hydra-queue-runner[9894]: got notification: new builds added to the queue
Jan 21 22:39:37 hydra hydra-queue-runner[9894]: checking the queue for builds > 559010...
...

So the evaluator correctly adds build 558937. Indeed it's also present in the DB:

hydra=> SELECT * FROM builds WHERE id = 558937;
-[ RECORD 1 ]------------+---------------------------------------------------------
id                       | 558937
finished                 | 0
timestamp                | 1579646374
project                  | ...
jobset                   | ...
job                      | ...
nixname                  | ...
description              |
drvpath                  | /nix/store/gvbqrw9jygizc7ixpc02fk8gc0zi30qj-....drv
system                   | x86_64-linux
license                  |
homepage                 |
maintainers              |
maxsilent                | 7200
timeout                  | 36000
ischannel                | 0
iscurrent                | 0
nixexprinput             | src
nixexprpath              | ci/ci.nix
priority                 | 100
globalpriority           | 0
starttime                |
stoptime                 |
iscachedbuild            |
buildstatus              |
size                     |
closuresize              |
releasename              |
keep                     | 0
notificationpendingsince |

The queue-runner is woken up and starts adding unfinished builds with an id > 558932. If I manually execute the query in getQueuedBuilds I get the following:

hydra=> select id from Builds 
        where id > 558932 
        and finished = 0 
        order by globalPriority desc, id;
   id
--------
 558935
 558936
 558937
 558938
 558939
 558940
...
 565890
 565891
 565892
(111 rows)

Note that build 558937 is included.

Now I would expect this build to be added to newIDs and newBuildsByID. The only reason this won't happen is if the queue-runner was started with --build-one (which it isn't) or if the build was already added to builds (which it isn't since this is the first run of getQueuedBuilds after the build got added by the evaluator).

Since I expect build to be in newIDs I expect this loop to iterate over it and apply the createBuild function to it.

createBuild in turn should log the message: loading build 558937 (...). Although I see other builds being loaded in the log, I strangely don't see this build being loaded.

So it seems my expectation is incorrect and createBuild(558937) is never called. Indeed, no steps have been created for this build:

hydra=> SELECT * FROM buildsteps WHERE build = 558937;
(0 rows)

The last part of the log is also interesting. The queue-runner starts loading builds > 558937 (i.e. loading build 559010) then it receives a notification that new builds have been added and calls getQueuedBuilds(lastBuildId = 559010)

After restarting the queu-runner all queued jobs are building again.

Any idea where the bug might be?

I've been trying to track down the same problem on our hydra. I've added some additional instrumentation to the queue-monitor and evaluator and what I've discovered is that there are two evaluations that complete at essentially the same time, so their entries into the Builds table are interleaved

hydra=> select id, project, jobset from Builds where id > 1139445 order by globalPriority desc, id limit 4;
   id   |  project   | jobset
--------+------------+----------
1139446 | proj1      | jobsetA
1139447 | proj1      | jobsetB
1139448 | proj1      | jobsetA
1139449 | proj1      | jobsetB

Then the hydra-queue-runner runs:

got notification of new builds
checking the queue for builds > 1139445
considering build 1139446
considering build 1139448
...

The considering message is one I've added, and it's in the loop at the top of getQueuedBuilds (portions replicated here):

{
    pqxx::work txn(conn);

    auto res = txn.exec_params
        ("select id, project, jobset, job ... from Builds "
          "where id > $1 and finished = 0 order by globalPriority desc, id",
          lastBuildId);

    for (auto const & row : res) {
        auto builds_(builds.lock());
        BuildID id = row["id"].as<BuildID>();
        if (buildOne && id != buildOne {
            printMsg(lvlTalkative, format("build %1% not one requested") % id);
            continue;
        }
        if (id > newLastBuildId) newLastBuildId = id;
        if (builds_->count(id)) {
            printMsg(lvlTalkative, format("build %1% count is nonzero") % id);
            continue;
        }
        ...
        printMsg(lvlTalkative, format("considering build %1%") % id);
        ...
    }

Neither of the continue-related prints is being emitted to the logs, so neither of these should be causing the builds to be skipped.

One theory is that one of the evaluators fails its transaction (all builds for a jobset are added within a transaction in the evaluator Perl code in hydra/src/script/hydra-eval-jobset) and has to re-run the transaction, and that the queue runner performs the above select before the transaction is successfully retried and completed (the queue runner runs in the following second after the evaluator adds are logged).

I don't actually think there is a relation to issue 366 (at least in its original form). Restarting the queue runner fixes this because on restart the queue runner's initial lastBuildId is 0 and so it collects these older builds. I've also had success with finding the oldest job in the web gui and performing "Cancel scheduled builds" and then "Retry cancelled builds" for that job... that usually seems to reset the lastBuildId used in the queue runner above.

I'm currently running a version of the hydra-queue-runner where I'm passing 0 instead of lastBuildId as the paameter value for the select clause above, under the assumption that the lastBuildId is mostly just an optimization for the DB lookup and not strictly necessary. I don't have an explicit reproduction method or frequency for the issue, so it's hard to know for sure if this fixes the problem, but I'll report back here in a week or so if I'm still seeing these unperformed builds.

An alternative way you could test this would be to set the following hydra service configuration:

    extraConfig = ''
        max_concurrent_evals = 1
    '';

However, if your hydra has as many jobsets to evaluate as ours does, that's not really viable.

If this theory regarding the source of this issue is correct, it's not clear to me that there is any better solution for managing the lastBuildId, or that it's even necessary to use this limiter, but I'd be interested in hearing other folks thoughts on possible alternatives or issues.

Update for this issue: since using 0 instead of lastBuildId in the select clause, there have been no "forgotten" jobs and no observable issues or noticeable performance degradations.

After the Feb21 update, the local hydra instance ran through the end of May without suffering abandoned builds in the Pending state. Upgrading to the most recent hydra sources and not adding this fix saw abandoned Pending builds within a day or two of operation.

Created https://github.com/NixOS/hydra/pull/776 to resolve this.

NixOS / hydra

No step created for build #706