crc-org / snc

Single Node Cluster creation scripts for OpenShift 4.x as used by CodeReady Containers
https://crc.dev
Apache License 2.0
100 stars 49 forks source link

Spike: topolvm CSI driver (supporting resize, limits) as replacement of kubevirt-hostpath-provisioner #854

Closed anjannath closed 3 months ago

anjannath commented 4 months ago
anjannath commented 4 months ago

Was able to deploy the lvms operator following the instructions from https://docs.openshift.com/container-platform/4.15/storage/persistent_storage/persistent_storage_local/persistent-storage-using-lvms.html

for testing this:

currently in the OCP preset, the root partition has 12G of free space:

/dev/vda4        31G   20G   12G  64% /sysroot

we can shrink the root partition by few gbs and create another partition out of the remaining free space, suppose /dev/vda4 of 26G and /dev/vda5 of 6G which we can set as the device/partition to be used by the lvms operator

with the partition created, we can apply the following manifests to deploy the lvms operator:

apiVersion: lvm.topolvm.io/v1alpha1
kind: LVMCluster
metadata:
  name: my-lvmcluster
  namespace: openshift-storage
spec:
  storage:
    deviceClasses:
    - name: vg1
      fstype: xfs
      default: true
      deviceSelector:
        paths:
        - /dev/vda5
        forceWipeDevicesAndDestroyAllData: true
      thinPoolConfig:
        name: thin-pool-1
        sizePercent: 90
        overprovisionRatio: 10
---
apiVersion: v1
kind: Namespace
metadata:
  labels:
    openshift.io/cluster-monitoring: "true"
    pod-security.kubernetes.io/enforce: privileged
    pod-security.kubernetes.io/audit: privileged
    pod-security.kubernetes.io/warn: privileged
  name: openshift-storage
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: openshift-storage-operatorgroup
  namespace: openshift-storage
spec:
  targetNamespaces:
  - openshift-storage
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: lvms
  namespace: openshift-storage
spec:
  installPlanApproval: Automatic
  name: lvms-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace

and to test it's working can test by applying:

apiVersion: v1
kind: Pod
metadata:
  name: testpod
spec:
  containers:
  - image: httpd
    name: testpod
    securityContext:
      capabilities:
        drop:
        - ALL
      runAsUser: 1001
      allowPrivilegeEscalation: false
    volumeMounts:
    - name: testtopo
      mountPath: /data
  volumes:
  - name: testtopo
    persistentVolumeClaim:
      claimName: lvm-file-1
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: "RuntimeDefault"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: lvm-file-1
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 1Gi
  storageClassName: lvms-vg1

resource consumption wise, deploying the lvms operator takes ~450mb of more RAM (some of it will be recovered after removing the hostpath-provisioner)

without lvms operator

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                2351m (40%)   0 (0%)
  memory             7857Mi (51%)  0 (0%)

with lvms operator deployed

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                2390m (41%)   0 (0%)
  memory             8337Mi (55%)  0 (0%)
  ephemeral-storage  0 (0%)        0 (0%)
praveenkumar commented 4 months ago

Other thing we need to keep in mind as of now everything (images/PVs ..etc) part of root partition for OCP bundles so when user use bigger disk size they don't care about if that extra size is used for images or for PV. But this is not the case with microshift bundle where we are using the topolvm and user have to think before expanding the disk if that extra space is going to used by PVs or images. Current hostpath-provisioner is not going to deprecated since it is extensively used in kubevirt side. What we want to achieve by changing it?

anjannath commented 4 months ago

Other thing we need to keep in mind as of now everything (images/PVs ..etc) part of root partition for OCP bundles so when user use bigger disk size they don't care about if that extra size is used for images or for PV. But this is not the case with microshift bundle where we are using the topolvm and user have to think before expanding the disk if that extra space is going to used by PVs or images

if we add topolvm for openshift also, it'll be same for both the preset, which should be good i think

Current hostpath-provisioner is not going to deprecated since it is extensively used in kubevirt side. What we want to achieve by changing it?

the hostpath-provisioner doesn't support resize and limits, this is the main reason for the switch i think it'd be good to give users the ability to experiment with these features.

praveenkumar commented 4 months ago

the hostpath-provisioner doesn't support resize and limits, this is the main reason for the switch i think it'd be good to give users the ability to experiment with these features.

@anjannath do we have some issue where user asked for those features or we think they will ask in future?

anjannath commented 4 months ago

the hostpath-provisioner doesn't support resize and limits, this is the main reason for the switch i think it'd be good to give users the ability to experiment with these features.

@anjannath do we have some issue where user asked for those features or we think they will ask in future?

there were questions about the size of the PV and how to have smaller PVs on our slack channel, but i haven't seen github issues for it no

praveenkumar commented 4 months ago

there were questions about the size of the PV and how to have smaller PVs

With hostpath-provisioner do we fix the size of the PV's? I thought user can define the required size and it would be created automatic?

anjannath commented 4 months ago

With hostpath-provisioner do we fix the size of the PV's? I thought user can define the required size and it would be created automatic?

no its a limitation of the hostpath-provisioner since its just creating directories in the host it has no mechanism to ensure the size, it just takes as much free space is available. see: https://github.com/kubevirt/hostpath-provisioner/issues/164#issuecomment-1413830124

praveenkumar commented 4 months ago

With hostpath-provisioner do we fix the size of the PV's? I thought user can define the required size and it would be created automatic?

no its a limitation of the hostpath-provisioner since its just creating directories in the host it has no mechanism to ensure the size, it just takes as much free space is available. see: kubevirt/hostpath-provisioner#164 (comment)

Thanks for sharing, so now we have to make decision around resource limit side since you mentioned it takes around ~450mb . As part of hostpath-provisioner we don't put a Mem/CPU request so even we remove it it will not minimize the resource limitation. Since for 4.15 we are already increasing around 1.5G resource I am not sure should we increase ~.5G more?

  Namespace                                         Name                                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                                         ----                                                       ------------  ----------  ---------------  -------------  ---
  hostpath-provisioner                              csi-hostpathplugin-fs4s6                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d3h
anjannath commented 3 months ago

I've been trying to modify the partition table on the ocp bundle through ignition to create a separate partition for use by topolvm, but it seems the disk gets re-partitioned in the second boot.. other idea that came up while talking to @gbraad was to use a second disk and no change the partition on the existing disk

using the following butane config the disks gets partitioned during first boot (during install) but after reboot it gets overwritten:

disks:
    - device: /dev/vda
      wipe_table: true
      partitions:
      - number: 1
        label: BIOS-BOOT
        size_mib: 1
        start_mib: 0
        type_guid: 21686148-6449-6E6F-744E-656564454649
      - number: 2
        size_mib: 127
        start_mib: 0
        label: EFI-SYSTEM
        type_guid: C12A7328-F81F-11D2-BA4B-00A0C93EC93B
      - number: 3
        label: boot
        size_mib: 384
        start_mib: 0
      - number: 4
        label: root
        size_mib: 24000
        start_mib: 0
      - number: 5
        label: pv-storage
        start_mib: 0

also tried to add a filesystems block and then the VM doesn't even boot:

filesystems:
    - device: /dev/disk/by-partlabel/BIOS-BOOT
      wipe_filesystem: true
      format: none
    - device: /dev/disk/by-partlabel/EFI-SYSTEM
      wipe_filesystem: true
      format: vfat
    - path: /boot
      device: /dev/disk/by-partlabel/boot
      format: ext4
      wipe_filesystem: true
      with_mount_unit: true
    - path: /root
      device: /dev/disk/by-partlabel/root
      format: xfs
      wipe_filesystem: true
      with_mount_unit: true
    - device: /dev/disk/by-partlabel/pv-storage
      format: ext4
      wipe_filesystem: true
anjannath commented 3 months ago

working on doing this in crc itself, made some progress that can be tested from this branch: https://github.com/anjannath/crc/tree/extradisk (only for macOS currently)

it is currently doing the following things:

  1. during vm creation (crc start) create a second disk image in the machine instance dir (named: crc-second-disk.img)
  2. during crc start once kube-apiserver is up, the topolvm operator group, subscription and the openshift-storage namespace is created
  3. once the installation of the operator succeeds, the LVMCluster resource is created which creates the lvm based storage class that can be used in the pvc definitions

after step 2 we have to wait ~2mins for the installation of the operator to succeed and only after that the LVMCluster custom resource becomes available for use. therefore i think if we install the operator during the snc phase and only create the LVMCluster resource during crc start that'd not increase the start time

cfergeau commented 3 months ago

2 minutes is very long, I agree that moving this logic to snc should help. Not clear why all of it can't be done in snc? If it doesn't work from ignition, this could still be done after the install is done as part of all the tweaks we are doing to the cluster?

anjannath commented 3 months ago

yes, we could do all of it in snc, what we need is on the disk we need a separate partition to be used by the lvms/topolvm operator and when the re-partitioning attempt with ignition failed and the idea of using a second disk came up i focused on doing it in crc as to use a second disk we would need changes to the libmachine drivers code.

but since now you mention it, if we don't use a second disk, and since we can modify the one disk image using guestfs tools we can do this entirely in snc..

  1. extend the disk image by some amount (5GB)
  2. shrink the root partition by some amount (from 31 GB to 26 GB so decrease by 5GB)
  3. create a new partition after the root partition which will be ~10GB
  4. apply all topolvm related manifests
cfergeau commented 3 months ago

For what it's worth, there is this enhancement open against microshift: https://github.com/openshift/enhancements/pull/1601 « MicroShift: Replacing upstream TopoLVM with a minified version of LVMS »

anjannath commented 3 months ago

From what i understand going through that enhancement doc is microshift is moving to use the the LVMS operator instead of the modified topolvm deployment that is there now

but what is not clear to me is that, the minified version of LVMS (called microLVMS in the doc) is going to be a separate thing or its a new feature in the LVMS operator itself and then microLVMS is going to be used in both openshift as well as microshift

anjannath commented 3 months ago

shrink the root partition by some amount (from 31 GB to 26 GB so decrease by 5GB)

the root partition filesystem is xfs and guestfish needs the filesystem to be resized before the partition can be resized, i think doing everything on snc will not be possible.. as i couldn't find something equivalent to resize2fs for xfs filesystem

cfergeau commented 3 months ago

shrink the root partition by some amount (from 31 GB to 26 GB so decrease by 5GB)

the root partition filesystem is xfs and guestfish needs the filesystem to be resized before the partition can be resized, i think doing everything on snc will not be possible.. as i couldn't find something equivalent to resize2fs for xfs filesystem

You can grow xfs filesystems, but you cannot shrink them.

anjannath commented 3 months ago

to summaries we are going to use LVMS operator for the dynanic PV provisioning in CRC, for this we need to:

  1. enhance the drivers code to create a second disk during vm creation (https://github.com/crc-org/crc/issues/4097)
  2. create the operatorgroup and subscription resources in the cluster during snc (https://github.com/crc-org/snc/issues/867)
  3. create the LVMCluster resource during crc start (https://github.com/crc-org/crc/issues/4097)