Hydra can take 5-10 mins to start working on queued jobs

mikepurvis commented 1 year ago

Describe the bug

I have a single-node Hydra instance, deployed from the flake as of https://github.com/NixOS/hydra/commit/e2756042b8e4397af642ee50eff50cf581df7f7b.

I use the API from other systems (Jenkins, etc) to automatically create and evaluate jobsets.

Jobsets show up in the /queue_summary page, but often take multiple minutes to start running. I don't believe this is evaluation time, as the CPU on the machine is 99% idle. Also, it evaluates locally in sub-60s.

The output on /queue-runner-status doesn't seem to indicate what's going on either, though admittedly I don't really know what I'm looking for.

Other tickets I found that seemed related spoke of infinite stall, which this is not— the builds do run eventually. And once it's going, it seems to chew through whatever is queued up, the issue mostly manifests if a job comes in and Hydra was previously idle.

The hydra config stanza is completely vanilla; the nix one does have some stuff in it that may be relevant:

  nix = {
    package = nix;
    extraOptions = ''
      experimental-features = nix-command flakes ca-derivations
      netrc-file = /etc/nix/netrc
      allowed-uris = ...
      flake-registry = ...
      pre-build-hook = ${./pre-build-hook.sh}
      post-build-hook = ${post-build-hook}
    '';
    settings = {
      substituters = [
        "https://cache.nixos.org"
        "https://ros.cachix.org"
      ];
      trusted-public-keys = ...

      # Should lead to 44 CPUs in use for builds, under maximum load.
      max-jobs = 11;
      cores = 4;
    };

    # Default is 0 (high). 7 is the lowest value.
    daemonIOSchedPriority = 7;
  };

The IO priority setting was because we were having nix-serve/harmonia time out trying to serve NARs when the machine was slammed. It's a 48-core machine, so this was intended to always "reserve" 4 cores for non-build purposes (web server, queue runner, serving NARs, etc).

Expected behavior

As soon as a build is queued, it runs.

Hydra Server:

Please fill out this data as well as you can, but don't worry if you can't -- just do your best.

OS and version: github:NixOS/nixpkgs/f677051b8dc0b5e2a9348941c99eea8c4b0ff28f
Version of Hydra: e2756042b8e4397af642ee50eff50cf581df7f7b:
Version of Nix Hydra is built against: nix-2.9.1
Version of the Nix daemon: nix-2.9.1

FYI @iwanders @zimbatm

leo60228 commented 1 year ago

I'm encountering similar behavior. Based on journalctl -u hydra-queue-runner, I think the extra time might come from substitutes.

SuperSandro2000 commented 1 year ago

@leo60228 I noticed the same. Always confuses me because it is not indicated anywhere in the UI.

NixOS / hydra

Hydra can take 5-10 mins to start working on queued jobs #1266