Open basvandijk opened 4 years ago
Seems related to: https://github.com/NixOS/hydra/issues/366.
I've been trying to track down the same problem on our hydra. I've added some additional instrumentation to the queue-monitor and evaluator and what I've discovered is that there are two evaluations that complete at essentially the same time, so their entries into the Builds table are interleaved
hydra=> select id, project, jobset from Builds where id > 1139445 order by globalPriority desc, id limit 4;
id | project | jobset
--------+------------+----------
1139446 | proj1 | jobsetA
1139447 | proj1 | jobsetB
1139448 | proj1 | jobsetA
1139449 | proj1 | jobsetB
Then the hydra-queue-runner runs:
got notification of new builds
checking the queue for builds > 1139445
considering build 1139446
considering build 1139448
...
The considering message is one I've added, and it's in the loop at the top of getQueuedBuilds
(portions replicated here):
{
pqxx::work txn(conn);
auto res = txn.exec_params
("select id, project, jobset, job ... from Builds "
"where id > $1 and finished = 0 order by globalPriority desc, id",
lastBuildId);
for (auto const & row : res) {
auto builds_(builds.lock());
BuildID id = row["id"].as<BuildID>();
if (buildOne && id != buildOne {
printMsg(lvlTalkative, format("build %1% not one requested") % id);
continue;
}
if (id > newLastBuildId) newLastBuildId = id;
if (builds_->count(id)) {
printMsg(lvlTalkative, format("build %1% count is nonzero") % id);
continue;
}
...
printMsg(lvlTalkative, format("considering build %1%") % id);
...
}
Neither of the continue
-related prints is being emitted to the logs, so neither of these should be causing the builds to be skipped.
One theory is that one of the evaluators fails its transaction (all builds for a jobset are added within a transaction in the evaluator Perl code in hydra/src/script/hydra-eval-jobset
) and has to re-run the transaction, and that the queue runner performs the above select before the transaction is successfully retried and completed (the queue runner runs in the following second after the evaluator adds are logged).
I don't actually think there is a relation to issue 366 (at least in its original form). Restarting the queue runner fixes this because on restart the queue runner's initial lastBuildId
is 0 and so it collects these older builds. I've also had success with finding the oldest job in the web gui and performing "Cancel scheduled builds" and then "Retry cancelled builds" for that job... that usually seems to reset the lastBuildId
used in the queue runner above.
I'm currently running a version of the hydra-queue-runner where I'm passing 0
instead of lastBuildId
as the paameter value for the select
clause above, under the assumption that the lastBuildId
is mostly just an optimization for the DB lookup and not strictly necessary. I don't have an explicit reproduction method or frequency for the issue, so it's hard to know for sure if this fixes the problem, but I'll report back here in a week or so if I'm still seeing these unperformed builds.
An alternative way you could test this would be to set the following hydra service configuration:
extraConfig = ''
max_concurrent_evals = 1
'';
However, if your hydra has as many jobsets to evaluate as ours does, that's not really viable.
If this theory regarding the source of this issue is correct, it's not clear to me that there is any better solution for managing the lastBuildId
, or that it's even necessary to use this limiter, but I'd be interested in hearing other folks thoughts on possible alternatives or issues.
Update for this issue: since using 0
instead of lastBuildId
in the select
clause, there have been no "forgotten" jobs and no observable issues or noticeable performance degradations.
After the Feb21 update, the local hydra instance ran through the end of May without suffering abandoned builds in the Pending state. Upgrading to the most recent hydra sources and not adding this fix saw abandoned Pending builds within a day or two of operation.
Created https://github.com/NixOS/hydra/pull/776 to resolve this.
Yesterday many builds on our Hydra got queued but didn't get build. The builds from later evaluations of the same jobsets did get build successfully.
I'm now trying to understand what went wrong in Hydra to cause this to happen. I haven't restarted the
hydra-queue-runner
yet to preserve as much state as possible.Let's look at one of the builds that are still queued: build 558937. Note this is a top-level build; no other builds depend on this one.
Let's first look at the log:
So the evaluator correctly adds build 558937. Indeed it's also present in the DB:
The queue-runner is woken up and starts adding unfinished builds with an id > 558932. If I manually execute the query in
getQueuedBuilds
I get the following:Note that build 558937 is included.
Now I would expect this build to be added to
newIDs
andnewBuildsByID
. The only reason this won't happen is if the queue-runner was started with--build-one
(which it isn't) or if the build was already added tobuilds
(which it isn't since this is the first run ofgetQueuedBuilds
after the build got added by the evaluator).Since I expect build to be in
newIDs
I expect this loop to iterate over it and apply thecreateBuild
function to it.createBuild
in turn should log the message:loading build 558937 (...)
. Although I see other builds being loaded in the log, I strangely don't see this build being loaded.So it seems my expectation is incorrect and
createBuild(558937)
is never called. Indeed, no steps have been created for this build:The last part of the log is also interesting. The queue-runner starts loading builds > 558937 (i.e.
loading build 559010
) then it receives a notification that new builds have been added and callsgetQueuedBuilds(lastBuildId = 559010)
After restarting the queu-runner all queued jobs are building again.
Any idea where the bug might be?