RFE: Increase resiliance for long-running clusters / optimize resource usage

ableischwitz commented 1 year ago

The default configuration of libvirt-machines is not very specific on resource-usage and disk optimization.

When running a OpenShift cluster on overprovisioned disk, one will want to make sure that sparse-disks will release/trim freed blocks, otherwise the vms will get paused due to no space left on device. Also there is no need to have a graphical system included on the vms, as serial output has some additional benefits like ability to scroll back to missed output.

Steps to be done:

[X] remove graphical-device and vnc from the vm-definition
[X] test "discard-mode: unlink" for OS-disks used by OCP
[X] check if scheduled fstrim runs need to be enabled on the nodes
[x] apply kubelet-config with tight imageGarbageCollection

This issue should be considered as a draft for optimizations in regards to limited resource usage on a single host.

ableischwitz commented 1 year ago

vm.xml.j2 needs adjustments:

Change disk-driver options:

   <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/{{ vm_instance_name }}.qcow2'/>
      <target dev='vda' bus='virtio'/>
   </disk>

should become to:

   <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' discard='unmap' />
      <source file='/var/lib/libvirt/images/{{ vm_instance_name }}.qcow2'/>
      <target dev='vda' bus='virtio'/>
   </disk>

Change machine-type:

<os>
    <type arch="x86_64">hvm</type>
    <boot dev="hd"/>
  </os>

Change from old pc-i440fx-* to q35-* and also switch to UEFI instead of BIOS (would allow secure-boot later on).

  <os firmware="efi">
    <type arch="x86_64" machine="q35" >hvm</type>
    <boot dev="hd"/>
  </os>

  <devices>
    <controller type="pci" index="0" model="pcie-root"/>
    ...
  </devices>

Remove:

<graphics type="vnc" port="-1"/>

Add:

<video>
    <model type='none'>
</video>

After the system is set to Q35 and the driver-option to enable discarding, I was able to run fstrim on a node:

% oc debug node/compute-0.compute.local
...
sh-4.4# chroot /host
sh-4.4# fstrim / -v
/: 108.9 GiB (116894867456 bytes) trimmed

Also the size of the qcow image reported a smaller number:

# qemu-img info -U /var/lib/libvirt/images/ocp4-compute-3.qcow2 
image: /var/lib/libvirt/images/ocp4-compute-0.qcow2
file format: qcow2
virtual size: 120 GiB (128849018880 bytes)
disk size: 10.3 GiB
cluster_size: 65536
backing file: /var/lib/libvirt/images/rhcos-4.10.3-x86_64-qemu.x86_64.qcow2
backing file format: qcow2
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    refcount bits: 16
    corrupt: false
    extended l2: false

rbo commented 1 year ago

For testing purpose, I added your suggestion into branch libvirt-xml-improvements

rbo commented 1 year ago

Tested on RHEL8 and RHEL9, vnc/video change is really great.

rbo commented 1 year ago

Branch libvirt-xml-improvements merged into devel.

ableischwitz commented 1 year ago

The following machineConfig would needed to be applied for reduced caching of images:

# from https://cloud.redhat.com/blog/image-garbage-collection-in-openshift
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: imgc-kubeconfig
spec:
  kubeletConfig:
    imageGCHighThresholdPercent: 66
    imageGCLowThresholdPercent: 50
    imageMinimumGCAge: "5m30s"
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: ""

rbo commented 1 year ago

@ableischwitz why not just use a smaller disk? What's the reason to reduce caching of images?

ableischwitz commented 1 year ago

The reason is quite simple: a) the size of 120G is documented as minimum size for disks and b) the size of disks is limited on such setups. The image-cache is sized for setups which don't suffer from disk-limitations.

Reducing that size (60% of 120G is still quite large) allows to maintain slim vm-disks, while also being able to use more space in case it's needed. In case we reduce the size of the vm-disks, there won't be any easy way to start larger (or rather hughe??) workload images.

rbo commented 1 year ago

I don't get it if you don't have enough size for 3x120Gb disk it then use smaller disks. You cannot grow either...

You would suggest adding this as post-install step, if you like to automate it you can write your own post-install add-on.

RedHat-EMEA-SSA-Team / hetzner-ocp4

RFE: Increase resiliance for long-running clusters / optimize resource usage #259