Repartition disks on Fedora Copr hypervisors

fedora-copr / copr

RPM build system - upstream for https://copr.fedorainfracloud.org/

113 stars 61 forks source link

Repartition disks on Fedora Copr hypervisors #2869

Closed praiskup closed 1 year ago

praiskup commented 1 year ago

This is likely blocking power9 builds

praiskup commented 1 year ago

Now root@vmhost-p08-copr01.rdu-cc.fedoraproject.org is down (not SSH responding).

praiskup commented 1 year ago

Requested rights to power-cycle: https://pagure.io/fedora-infrastructure/issue/11476

praiskup commented 1 year ago

Hm, the machines are UP now (uptime 4 days), but we fail to get ipv6 now.

praiskup commented 1 year ago

After another reboot, it started working.

praiskup commented 1 year ago

As in #2883, note for the team $ sudo rbac-playbook groups/copr-hypervisor.yml -l '*p08*' -t trigger_reboot can help here.

praiskup commented 1 year ago

Again both p08 machines down :-( being resolved still in the same infra ticket but reopening to "admit" there still is some issue.

nikromen commented 1 year ago

and they were down once again

praiskup commented 1 year ago

We want to move / partition onto it's own raid, and stop using the raid6 for swap and disposable epemeral images (raid0 or lvm striping should be used for that).

Tracking progress:

[x] vmhost-x86-copr01.rdu-cc.fedoraproject.org
[x] vmhost-x86-copr02.rdu-cc.fedoraproject.org
[x] vmhost-x86-copr03.rdu-cc.fedoraproject.org
[x] vmhost-x86-copr04.rdu-cc.fedoraproject.org
[x] vmhost-p08-copr01.rdu-cc.fedoraproject.org
[x] vmhost-p08-copr02.rdu-cc.fedoraproject.org
[x] vmhost-p09-copr01.rdu-cc.fedoraproject.org
[x] blogpost
[ ] ~~fill a bug against qcow2 performance degradation?~~
[x] document idrac / p08 / p09 access for team

praiskup commented 1 year ago

https://nagios.fedoraproject.org/nagios/cgi-bin//status.cgi?hostgroup=copr_hypervisor&style=detail

praiskup commented 1 year ago

The (rather long, eventually) post is here: https://pavel.raiskup.cz/blog/fedora-copr-hypervisor-disk-repartitioning.html

praiskup commented 1 year ago

I spent at least an hour writing a good report now against LibVirt, and then I realized that actually, the Power9 machine on Fedora 38 doesn't suffer from the same qcow2 issues (it is on SSD, but still). The other machines are slow, but also very old. While it doesn't make sense to have a disk slowdown with qcow2 after rhel6 to rhel0 to me, I don't think I want to bother LibVirt folks (OK on newer hardware, OK also on older after moving to RAW). Giving up the last sub-task.