NixOS / hydra

Hydra, the Nix-based continuous build system
http://nixos.org/hydra
GNU General Public License v3.0
1.1k stars 291 forks source link

Hydra can take 5-10 mins to start working on queued jobs #1266

Open mikepurvis opened 1 year ago

mikepurvis commented 1 year ago

Describe the bug

I have a single-node Hydra instance, deployed from the flake as of https://github.com/NixOS/hydra/commit/e2756042b8e4397af642ee50eff50cf581df7f7b.

I use the API from other systems (Jenkins, etc) to automatically create and evaluate jobsets.

Jobsets show up in the /queue_summary page, but often take multiple minutes to start running. I don't believe this is evaluation time, as the CPU on the machine is 99% idle. Also, it evaluates locally in sub-60s.

The output on /queue-runner-status doesn't seem to indicate what's going on either, though admittedly I don't really know what I'm looking for.

Other tickets I found that seemed related spoke of infinite stall, which this is not— the builds do run eventually. And once it's going, it seems to chew through whatever is queued up, the issue mostly manifests if a job comes in and Hydra was previously idle.

The hydra config stanza is completely vanilla; the nix one does have some stuff in it that may be relevant:

  nix = {
    package = nix;
    extraOptions = ''
      experimental-features = nix-command flakes ca-derivations
      netrc-file = /etc/nix/netrc
      allowed-uris = ...
      flake-registry = ...
      pre-build-hook = ${./pre-build-hook.sh}
      post-build-hook = ${post-build-hook}
    '';
    settings = {
      substituters = [
        "https://cache.nixos.org"
        "https://ros.cachix.org"
      ];
      trusted-public-keys = ...

      # Should lead to 44 CPUs in use for builds, under maximum load.
      max-jobs = 11;
      cores = 4;
    };

    # Default is 0 (high). 7 is the lowest value.
    daemonIOSchedPriority = 7;
  };

The IO priority setting was because we were having nix-serve/harmonia time out trying to serve NARs when the machine was slammed. It's a 48-core machine, so this was intended to always "reserve" 4 cores for non-build purposes (web server, queue runner, serving NARs, etc).

Expected behavior

As soon as a build is queued, it runs.

Hydra Server:

Please fill out this data as well as you can, but don't worry if you can't -- just do your best.

FYI @iwanders @zimbatm

leo60228 commented 1 year ago

I'm encountering similar behavior. Based on journalctl -u hydra-queue-runner, I think the extra time might come from substitutes.

SuperSandro2000 commented 1 year ago

@leo60228 I noticed the same. Always confuses me because it is not indicated anywhere in the UI.