fedora-copr / copr

RPM build system - upstream for https://copr.fedorainfracloud.org/
115 stars 61 forks source link

Builders often fail at startup because of missing swap #3053

Closed praiskup closed 7 months ago

praiskup commented 10 months ago


TASK [Make sure that we have 140G memory or SWAP for Mock caches] **************
Saturday 09 December 2023  21:11:15 +0000 (0:00:00.952)       0:01:11.132 ***** 
fatal: [50.17.25.239]: FAILED! => {
    "assertion": "ansible_memory_mb.swap.free + ansible_memory_mb.real.free >= 1024 * 140",
    "changed": false,
    "evaluated_to": false,
    "msg": "Not enough memory+swap, the builder is unusable: 0 MB SWAP and 6768 MB RAM"
}

PLAY RECAP *********************************************************************
50.17.25.239               : ok=16   changed=8    unreachable=0    failed=1    skipped=19   rescued=0    ignored=0   
FrostyX commented 10 months ago

This seems to be more complicated than we expected.

I found this in journalctl -u enable-swap:

systemd[1]: Started enable-swap.service - Enable swap on /dev/nvme0n1.
enable-swap.sh[719]: + set -e
enable-swap.sh[719]: + part_suffix=p
enable-swap.sh[719]: + swap_device=
enable-swap.sh[719]: + test -f /config/resalloc-vars.sh
enable-swap.sh[719]: + grep POWER9 /proc/cpuinfo
enable-swap.sh[719]: + test -e /dev/xvda1
enable-swap.sh[719]: + test -e /dev/nvme1n1
enable-swap.sh[719]: + swap_device=/dev/nvme1n1
enable-swap.sh[719]: + test -n /dev/nvme1n1
enable-swap.sh[719]: + systemctl unmask tmp.mount
enable-swap.sh[719]: + systemctl start tmp.mount
enable-swap.sh[827]: + echo 'n
enable-swap.sh[827]: p
enable-swap.sh[827]: +16G
enable-swap.sh[827]: n
enable-swap.sh[827]: p
enable-swap.sh[827]: 2
enable-swap.sh[827]: w
enable-swap.sh[827]: '
enable-swap.sh[828]: + fdisk /dev/nvme1n1
enable-swap.sh[828]: Welcome to fdisk (util-linux 2.38.1).
enable-swap.sh[828]: Changes will remain in memory only, until you decide to write them.
enable-swap.sh[828]: Be careful before using the write command.
enable-swap.sh[828]: Device does not contain a recognized partition table.
enable-swap.sh[828]: Created a new DOS disklabel with disk identifier 0x9afa9268.
enable-swap.sh[828]: Command (m for help): Partition type
enable-swap.sh[828]:    p   primary (0 primary, 0 extended, 4 free)
enable-swap.sh[828]:    e   extended (container for logical partitions)
enable-swap.sh[828]: Select (default p): Partition number (1-4, default 1): First sector (2048-335544319, default 2048): Last sector, +/-sectors or +/-size{K,M,G,T,P} (2048-335544319, default 3>
enable-swap.sh[828]: Created a new partition 1 of type 'Linux' and of size 16 GiB.
enable-swap.sh[828]: Command (m for help): Partition type
enable-swap.sh[828]:    p   primary (1 primary, 0 extended, 3 free)
enable-swap.sh[828]:    e   extended (container for logical partitions)
enable-swap.sh[828]: Select (default p): Partition number (2-4, default 2): First sector (33556480-335544319, default 33556480): Last sector, +/-sectors or +/-size{K,M,G,T,P} (33556480-33554431>
enable-swap.sh[828]: Created a new partition 2 of type 'Linux' and of size 144 GiB.
enable-swap.sh[828]: Command (m for help): The partition table has been altered.
enable-swap.sh[828]: Calling ioctl() to re-read partition table.
enable-swap.sh[828]: Syncing disks.
enable-swap.sh[719]: + mkfs.ext4 /dev/nvme1n1p1
enable-swap.sh[832]: mke2fs 1.46.5 (30-Dec-2021)
enable-swap.sh[832]: The file /dev/nvme1n1p1 does not exist and no size was specified.
systemd[1]: enable-swap.service: Main process exited, code=exited, status=1/FAILURE

Our guess with @praiskup is that we may need to run partprobe or while test -n /dev/nvme1n1p1; do sleep 1; done after fdisk in enable-swap.sh.

The problem is, that we copy the enable-swap.sh script, and enable-swap.service and bake them to the builder image. We also systemctl enable the service, so that it is run when booting the builder. As a consequence, there seems to be no way of fixing the enable-swap.sh without rebuilding the builder image.

This issue might need to be done as a part of #3036

FrostyX commented 10 months ago

This happened when I executed the script manually

[root@aws-aarch64-spot-dev-00063572-20231214-195213 ~]# enable-swap.sh
+ set -e
+ part_suffix=p
+ swap_device=
+ test -f /config/resalloc-vars.sh
+ grep POWER9 /proc/cpuinfo
+ test -e /dev/xvda1
+ test -e /dev/nvme1n1
+ swap_device=/dev/nvme1n1
+ test -n /dev/nvme1n1
+ systemctl unmask tmp.mount
+ systemctl start tmp.mount
+ echo 'n
p

+16G
n
p
2

w
'
+ fdisk /dev/nvme1n1

Welcome to fdisk (util-linux 2.38.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Command (m for help): All space for primary partitions is in use.

Command (m for help): Disk /dev/nvme1n1: 160 GiB, 171798691840 bytes, 335544320 sectors
Disk model: Amazon Elastic Block Store
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x9afa9268

Device         Boot    Start       End   Sectors  Size Id Type
/dev/nvme1n1p1          2048  33556479  33554432   16G 83 Linux
/dev/nvme1n1p2      33556480 335544319 301987840  144G 83 Linux

Command (m for help):

Command (m for help):

Command (m for help): +: unknown command

Command (m for help):
All space for primary partitions is in use.

Command (m for help): Disk /dev/nvme1n1: 160 GiB, 171798691840 bytes, 335544320 sectors
Disk model: Amazon Elastic Block Store
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x9afa9268

Device         Boot    Start       End   Sectors  Size Id Type
/dev/nvme1n1p1          2048  33556479  33554432   16G 83 Linux
/dev/nvme1n1p2      33556480 335544319 301987840  144G 83 Linux

Command (m for help): 2: unknown command

Command (m for help):

Command (m for help):

Command (m for help):
The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

+ mkfs.ext4 /dev/nvme1n1p1
mke2fs 1.46.5 (30-Dec-2021)
Creating filesystem with 4194304 4k blocks and 1048576 inodes
Filesystem UUID: 0df93fa3-49b1-4b4d-bf59-c4a1982f7828
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

+ mount /dev/nvme1n1p1 /var/lib/copr-rpmbuild
+ mkdir /var/lib/copr-rpmbuild/results
+ mkdir /var/lib/copr-rpmbuild/workspace
+ rpm --setperms copr-rpmbuild
+ rpm --setugids copr-rpmbuild
+ mkswap /dev/nvme1n1p2
Setting up swapspace version 1, size = 144 GiB (154617769984 bytes)
no label, UUID=d3991c50-de8e-42f0-a0bf-e5fb50c9b21a
+ swapon /dev/nvme1n1p2
[root@aws-aarch64-spot-dev-00063572-20231214-195213 ~]# echo $?
0
praiskup commented 7 months ago

Triage time: Should we check the swap as one of the first tasks?

praiskup commented 7 months ago

Related commits: https://pagure.io/fedora-infra/ansible/c/476e5454ec318ccf8ecbef204f4f4f898ec0e7e4 https://pagure.io/fedora-infra/ansible/c/f7a96ebb5b57b98ee0ae080b2319c02cfda19017

praiskup commented 7 months ago

It sometimes fails for non-aarch64 builders:

TASK [make sure the enable-swap service has already finished] ******************
Wednesday 28 February 2024  15:50:19 +0000 (0:00:01.333)       0:00:08.906 **** 
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (20 retries left).
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (19 retries left).
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (18 retries left).
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (17 retries left).
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (16 retries left).
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (15 retries left).
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (14 retries left).
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (13 retries left).
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (12 retries left).
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (11 retries left).
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (10 retries left).
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (9 retries left).
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (8 retries left).
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (7 retries left).
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (6 retries left).
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (5 retries left).
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (4 retries left).
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (3 retries left).
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (2 retries left).
FAILED - RETRYING: [2620:52:3:1:dead:beef:cafe:c1c0]: make sure the enable-swap service has already finished (1 retries left).
fatal: [2620:52:3:1:dead:beef:cafe:c1c0]: FAILED! => {"attempts": 20, "changed": true, "cmd": "systemctl is-active enable-swap || :", "delta": "0:00:00.014435", "end": "2024-02-28 15:50:54.939770", "msg": "", "rc": 0, "start": "2024-02-28 15:50:54.925335", "stderr": "", "stderr_lines": [], "stdout": "failed", "stdout_lines": ["failed"]}
praiskup commented 7 months ago

Debugging with: https://pagure.io/fedora-infra/ansible/c/fe9f0e6f751db2a499fe5bed515d34b1323490a4

        "Feb 28 16:09:52 localhost.localdomain systemd[1]: Started enable-swap.service - Enable swap on /dev/nvme0n1.",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[997]: + set -e",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[997]: + part_suffix=p",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[997]: + swap_device=",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[997]: + test -f /config/resalloc-vars.sh",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[997]: + generic_mount",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[997]: + for vol in /dev/vdb /dev/vdc /dev/vdd /dev/vda",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1000]: + grep /dev/vdb",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[999]: + mount",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1002]: ++ blockdev --getsize64 /dev/vdb",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[997]: + size=168000000000",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[997]: + test 168000000000 -le 150000000000",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[997]: + swap_device=/dev/vdb",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[997]: + part_suffix=",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[997]: + break",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[997]: + test -n /dev/vdb",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[997]: + systemctl unmask tmp.mount",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[997]: + systemctl start tmp.mount",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1047]: + echo 'n",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1047]: p",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1047]: +16G",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1047]: n",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1047]: p",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1047]: 2",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1047]: w",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1047]: '",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1048]: + fdisk /dev/vdb",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1048]: Welcome to fdisk (util-linux 2.39.3).",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1048]: Changes will remain in memory only, until you decide to write them.",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1048]: Be careful before using the write command.",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1048]: Device does not contain a recognized partition table.",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1048]: Created a new DOS (MBR) disklabel with disk identifier 0x0c9f8538.",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1048]: Command (m for help): Partition type",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1048]:    p   primary (0 primary, 0 extended, 4 free)",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1048]:    e   extended (container for logical partitions)",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1048]: Select (default p): Partition number (1-4, default 1): First sector (2048-328124999, default 2048): Last sector, +/-sectors or +/-size{K,M,G,T,P} (2048-328124999, default 328124999):",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1048]: Created a new partition 1 of type 'Linux' and of size 16 GiB.",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1048]: Command (m for help): Partition type",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1048]:    p   primary (1 primary, 0 extended, 3 free)",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1048]:    e   extended (container for logical partitions)",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1048]: Select (default p): Partition number (2-4, default 2): First sector (33556480-328124999, default 33556480): Last sector, +/-sectors or +/-size{K,M,G,T,P} (33556480-328124999, default 328124999):",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1048]: Created a new partition 2 of type 'Linux' and of size 140.5 GiB.",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1048]: Command (m for help): The partition table has been altered.",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1048]: Calling ioctl() to re-read partition table.",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1048]: Syncing disks.",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[997]: + mkfs.ext4 /dev/vdb1",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1051]: mke2fs 1.47.0 (5-Feb-2023)",
        "Feb 28 16:09:52 localhost.localdomain enable-swap.sh[1051]: The file /dev/vdb1 does not exist and no size was specified.",
        "Feb 28 16:09:52 localhost.localdomain systemd[1]: enable-swap.service: Main process exited, code=exited, status=1/FAILURE",
        "Feb 28 16:09:52 localhost.localdomain systemd[1]: enable-swap.service: Failed with result 'exit-code'."
praiskup commented 7 months ago

But fails very fast now at least:

Wednesday 28 February 2024  16:10:09 +0000 (0:00:00.049)       0:00:11.263 **** 
=============================================================================== 
put infra stg repos into yum.repos.d if staging ------------------------- 3.07s
setup the hostname so we can easily identify the box -------------------- 2.90s
Activate Red Hat Subscription ------------------------------------------- 1.58s
debug the swap failure -------------------------------------------------- 0.89s
make sure the enable-swap service has already finished ------------------ 0.88s
include_tasks ----------------------------------------------------------- 0.80s
define helper facts ----------------------------------------------------- 0.10s
drop extremely slow libvirt resolver ------------------------------------ 0.08s
put infra repos into yum.repos.d ---------------------------------------- 0.06s
disable zram SWAP on builders, it is too small, issue 2077 -------------- 0.06s
set lower metadata expire time to enforce download ---------------------- 0.05s
print out the output ---------------------------------------------------- 0.05s
disable hcn-init service on ppc64le which implies NetworkManager-wait-online --- 0.03s
clean dnf cache before checking for updated packages -------------------- 0.03s
install the subscription-manager script --------------------------------- 0.03s
install subscription-manager -------------------------------------------- 0.03s
fallback to the legacy crypto policies ---------------------------------- 0.03s
disable updates-testing, could be restored after update ----------------- 0.03s
disable updates-testing ------------------------------------------------- 0.02s
update the system ------------------------------------------------------- 0.02s
praiskup commented 7 months ago

I applied those two before we forget again: https://pagure.io/fedora-infra/ansible/c/711b6bb8c63797a6a1eb712e19aa7ae73beb1356 https://pagure.io/fedora-infra/ansible/c/053b703e4b68f7c8f30deb1d41995fd018f54366

I don't think we can do anything else than wait for the next image rebuild now. Closing.