NixOS / infra

NixOS configurations for nixos.org and its servers
MIT License
208 stars 91 forks source link

EQM Hydra builders fail to setup their swap partition since the 24.05-pre update #356

Closed delroth closed 4 months ago

delroth commented 4 months ago

As reported and initially troubleshooted by @k900 on #infra earlier today. I rebooted a builder with the serial console attached in hope of getting more useful debugging info, and all I got was unfortunately this:

[FAILED] Failed to start add-disks-to-swap.service.
See 'systemctl status add-disks-to-swap.service' for details.

Since this is causing some builds to randomly fail, I suggest that if we don't find a quick fix (e.g. within 24h from now) we should try to roll back to the latest (23.05-based) working version. Until then, deploying https://github.com/NixOS/equinix-metal-builders/pull/24 would help to get some more useful logs and live-troubleshooting.

vcunat commented 4 months ago

Those machines have /nix/store in RAM and do not GC, so once they fill it up, they will fail all builds, I believe.

mweinelt commented 4 months ago
systemd[1]: Starting add-disks-to-swap.service...
add-disks-to-swap-start[2707]: + '[' '!' -e /dev/md/spill.decrypted ']'
add-disks-to-swap-start[2707]: + /nix/store/nn0k22ka20i9r1jp7jg0rmphdcfi7v2y-kmod-31/bin/modprobe raid0
add-disks-to-swap-start[2707]: + echo 2
add-disks-to-swap-start[2707]: + /nix/store/c9rqpsd8b3l82z0fjz90n7x28nkc0240-util-linux-2.39.3-bin/bin/lsblk -d -e 1,7,11,230 -o PATH -n
add-disks-to-swap-start[2707]: + /nix/store/5idwbbv23b6vnqdicx97s3hsgrwwnj7j-coreutils-9.4/bin/cat disklist
add-disks-to-swap-start[2710]: /dev/sda
add-disks-to-swap-start[2710]: /dev/sdb
add-disks-to-swap-start[2710]: /dev/nvme0n1
add-disks-to-swap-start[2710]: /dev/nvme1n1
add-disks-to-swap-start[2711]: + /nix/store/5idwbbv23b6vnqdicx97s3hsgrwwnj7j-coreutils-9.4/bin/cat disklist
add-disks-to-swap-start[2714]: ++ /nix/store/5idwbbv23b6vnqdicx97s3hsgrwwnj7j-coreutils-9.4/bin/cat disklist
add-disks-to-swap-start[2715]: ++ /nix/store/i2g0wn70vhrplsq3k0b170cfxr2rhrbb-busybox-1.36.1/bin/wc -l
add-disks-to-swap-start[2712]: + /nix/store/4ajik70nplhkb8ndn3gqh7v0b09hmvg9-findutils-4.9.0/bin/xargs /nix/store/qviglm6izbcard2dfjanlyiv0v66zvmi-mdadm-4.2/bin/mdadm /dev/md/spill.decrypted --create --level=0 --force --raid-devices=4
add-disks-to-swap-start[2716]: mdadm: /dev/nvme0n1 appears to be part of a raid array:
add-disks-to-swap-start[2716]:        level=raid0 devices=2 ctime=Sat Feb 10 10:25:43 2024
add-disks-to-swap-start[2716]: mdadm: /dev/nvme1n1 appears to be part of a raid array:
add-disks-to-swap-start[2716]:        level=raid0 devices=2 ctime=Sat Feb 10 10:25:43 2024
add-disks-to-swap-start[2716]: Continue creating array? mdadm: create aborted.
systemd[1]: add-disks-to-swap.service: Main process exited, code=exited, status=123/n/a
systemd[1]: add-disks-to-swap.service: Failed with result 'exit-code'.
systemd[1]: Failed to start add-disks-to-swap.service.
mweinelt commented 4 months ago

After enabling boot.swraid.enable in https://github.com/NixOS/equinix-metal-builders/commit/7515b04ba8a8e66f5b73eab9dccb343feb00caf7 the raid is coming into place on the second try.

systemd[1]: Starting add-disks-to-swap.service...
add-disks-to-swap-start[2302]: + '[' '!' -e /dev/md/spill.decrypted ']'
add-disks-to-swap-start[2302]: + /nix/store/nn0k22ka20i9r1jp7jg0rmphdcfi7v2y-kmod-31/bin/modprobe raid0
add-disks-to-swap-start[2302]: + echo 2
add-disks-to-swap-start[2302]: + /nix/store/c9rqpsd8b3l82z0fjz90n7x28nkc0240-util-linux-2.39.3-bin/bin/lsblk -d -e 1,7,11,230 -o PATH -n
add-disks-to-swap-start[2302]: + /nix/store/5idwbbv23b6vnqdicx97s3hsgrwwnj7j-coreutils-9.4/bin/cat disklist
add-disks-to-swap-start[2468]: /dev/nvme1n1
add-disks-to-swap-start[2468]: /dev/nvme0n1
add-disks-to-swap-start[2471]: + /nix/store/5idwbbv23b6vnqdicx97s3hsgrwwnj7j-coreutils-9.4/bin/cat disklist
add-disks-to-swap-start[2474]: ++ /nix/store/5idwbbv23b6vnqdicx97s3hsgrwwnj7j-coreutils-9.4/bin/cat disklist
add-disks-to-swap-start[2475]: ++ /nix/store/i2g0wn70vhrplsq3k0b170cfxr2rhrbb-busybox-1.36.1/bin/wc -l
add-disks-to-swap-start[2472]: + /nix/store/4ajik70nplhkb8ndn3gqh7v0b09hmvg9-findutils-4.9.0/bin/xargs /nix/store/qviglm6izbcard2dfjanlyiv0v66zvmi-mdadm-4.2/bin/mdadm /dev/md/spill.decrypted --create --level=0 --run --force --raid-devices=2
add-disks-to-swap-start[2476]: mdadm: /dev/nvme1n1 appears to be part of a raid array:
add-disks-to-swap-start[2476]:        level=raid0 devices=4 ctime=Sat Feb 10 12:14:40 2024
add-disks-to-swap-start[2476]: mdadm: super1.x cannot open /dev/nvme0n1: Device or resource busy
add-disks-to-swap-start[2476]: mdadm: /dev/nvme0n1 is not suitable for this array.
add-disks-to-swap-start[2476]: mdadm: create aborted
systemd[1]: add-disks-to-swap.service: Main process exited, code=exited, status=123/n/a
systemd[1]: add-disks-to-swap.service: Failed with result 'exit-code'.
systemd[1]: Failed to start add-disks-to-swap.service.
systemd[1]: add-disks-to-swap.service: Scheduled restart job, restart counter is at 1.
systemd[1]: Starting add-disks-to-swap.service...
add-disks-to-swap-start[2548]: + '[' '!' -e /dev/md/spill.decrypted ']'
add-disks-to-swap-start[2548]: + /nix/store/sjk5br9x1ljw85s6zjwp6jjxjkj2ky2p-cryptsetup-2.6.1-bin/bin/cryptsetup -c aes-xts-plain64 -d /dev/random create spill.encrypted /dev/md/spill.decrypted
add-disks-to-swap-start[2548]: + /nix/store/c9rqpsd8b3l82z0fjz90n7x28nkc0240-util-linux-2.39.3-bin/bin/mkswap /dev/mapper/spill.encrypted
add-disks-to-swap-start[2573]: Setting up swapspace version 1, size = 7.4 TiB (8161084829696 bytes)
add-disks-to-swap-start[2573]: no label, UUID=30ccced5-08ce-4f72-ba80-70c11c3f2936
add-disks-to-swap-start[2548]: + /nix/store/c9rqpsd8b3l82z0fjz90n7x28nkc0240-util-linux-2.39.3-bin/bin/swapon /dev/mapper/spill.encrypted
add-disks-to-swap-start[2585]: ++ /nix/store/c9rqpsd8b3l82z0fjz90n7x28nkc0240-util-linux-2.39.3-bin/bin/lsblk --noheadings --bytes --output SIZE /dev/mapper/spill.encrypted
add-disks-to-swap-start[2548]: + size=8161084833792
add-disks-to-swap-start[2586]: ++ /nix/store/dvvb6frpdnimidx1f51zjgi3af8rlny1-glibc-2.38-27-bin/bin/getconf PAGESIZE
add-disks-to-swap-start[2548]: + pagesize=4096
add-disks-to-swap-start[2548]: + inodes=1992452352
add-disks-to-swap-start[2548]: + /nix/store/c9rqpsd8b3l82z0fjz90n7x28nkc0240-util-linux-2.39.3-bin/bin/mount -o remount,size=8161084833792,nr_inodes=1992452352 /
systemd[1]: add-disks-to-swap.service: Deactivated successfully.
systemd[1]: Finished add-disks-to-swap.service.

Works, but leaves potential for improvements.