Open sternenseemann opened 1 year ago
You can admire the workaround we had to resort to due to this issue here: https://github.com/NixOS/nixpkgs/pull/253170.
It seems like /builds currently runs O(N)
SQL queries because of a missing prefetch for jobset. That probably worked ok by accident when psql was hosted on the same machine as Hydra, but now that they're not even in the same datacenter that latency makes the issue obvious. I'll need to learn DBIx to fix this, but it seems trivial.
https://github.com/NixOS/hydra/pull/1335 I'll apply this patch to prod Hydra to give it a shot.
Note that it currently works again. It seems that was a temporary (O(days)) period of increased load on Hydra.
FWIW the eval you listed as example in your original message (1798805) still times out for me. With the temporarily deployed Hydra patch (now rolled back), it took 2min33 instead.
The
/eval/<id>/builds
endpoint (currently undocumented) has gotten massively slower, probably in a recent deploy ofrhea
(https://github.com/NixOS/nixos-org-configurations/commit/d745c2f5f72ad1dd2128e7b7d35e30d4039d2281). If that turns out to be true, it could either be due to the Hydra or the channel update. Hydra doesn't seem to have any interesting changes in the diff.The endpoint still works if the jobset is sufficiently small (e.g.
nix:master
):However, for a jobset with more builds per eval, this endpoint appears to time out: Requesting the endpoint for evals of the
nixpkgs:haskell-updates
(even those with ~27K builds) will result in one of the two things after exactly 15min:curl --http1.1 -v -H 'Accept: application/json' https://hydra.nixos.org/eval/1798805/builds
), an error 404 is received, but the response takes exactly 15min, so it seems like some sort of timeout.curl -v -H 'Accept: application/json' https://hydra.nixos.org/eval/1798805/builds
) , some kind of OpenSSL failure is encountered after 15min.HTTP/2 log
``` * Connected to hydra.nixos.org (2a01:4f8:162:71eb::) port 443 (#0) * ALPN: offers h2,http/1.1 } [5 bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * TLSv1.3 (IN), TLS handshake, Server hello (2): { [122 bytes data] * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): { [19 bytes data] * TLSv1.3 (IN), TLS handshake, Certificate (11): { [3769 bytes data] * TLSv1.3 (IN), TLS handshake, CERT verify (15): { [79 bytes data] * TLSv1.3 (IN), TLS handshake, Finished (20): { [52 bytes data] * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Finished (20): } [52 bytes data] * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 * ALPN: server accepted h2 * Server certificate: * subject: CN=hydra.nixos.org * start date: Jul 19 15:23:10 2023 GMT * expire date: Oct 17 15:23:09 2023 GMT * subjectAltName: host "hydra.nixos.org" matched cert's "hydra.nixos.org" * issuer: C=US; O=Let's Encrypt; CN=R3 * SSL certificate verify ok. } [5 bytes data] * using HTTP/2 * h2 [:method: GET] * h2 [:scheme: https] * h2 [:authority: hydra.nixos.org] * h2 [:path: /eval/1798805/builds] * h2 [user-agent: curl/8.1.2] * h2 [accept: application/json] * Using Stream ID: 1 (easy handle 0x46af80) } [5 bytes data] > GET /eval/1798805/builds HTTP/2 > Host: hydra.nixos.org > User-Agent: curl/8.1.2 > Accept: application/json > { [5 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [281 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [281 bytes data] * old SSL session ID is stale, removing { [5 bytes data] 0 0 0 0 0 0 0 0 --:--:-- 0:13:28 --:--:-- 0* received GOAWAY, error=0, last_stream=2147483647 0 0 0 0 0 0 0 0 --:--:-- 0:14:59 --:--:-- 0{ [5 bytes data] * TLSv1.3 (OUT), TLS alert, protocol version (582): } [2 bytes data] * OpenSSL SSL_read: OpenSSL/3.0.9: error:0A00010B:SSL routines::wrong version number, errno 0 * Failed receiving HTTP2 data: 56(Failure when receiving data from the peer) 0 0 0 0 0 0 0 0 --:--:-- 0:15:00 --:--:-- 0 * Closing connection 0 curl: (56) OpenSSL SSL_read: OpenSSL/3.0.9: error:0A00010B:SSL routines::wrong version number, errno 0 ```This started happening on 2023-08-21, somtime after 14:02 CEST which you can check by looking at these action runs which request the endpoint.
That script uses a 10min timeout on the request, so we know that before that time this endpoint would respond in under 10min. If I remember correctly, usually it'd be under 5min.
The endpoint is also used for automatically updating metadata for
haskellPackages
in nixpkgs, so we are blocked on that.