IBM / ibm-spectrum-scale-csi

The IBM Spectrum Scale Container Storage Interface (CSI) project enables container orchestrators, such as Kubernetes and OpenShift, to manage the life-cycle of persistent storage.
Apache License 2.0
68 stars 49 forks source link

Base the default inodeLimit for storage classes on fixed inodes/GiB ratio rather than file system block size #414

Closed gfschmidt closed 2 years ago

gfschmidt commented 3 years ago

Is your feature request related to a problem? Please describe. While deploying IBM Cloud Pak for Data v3.5.2 (CP4D) with CNSA v5.1.0.3 and CSI v2.1.0 on OpenShift 4.5 and 4.6 in different environments we observed circumstances where the CP4D control plane ("lite" assembly) did not succeed to install due to a PVC (of one of the subcomponents) running out of inodes:

# oc logs zen-pre-requisite-job-sxssv
cp: cannot create regular file '/user-home/_global_/tmp/./cacerts': No space left on device

The PVC was backed by an independent fileset created from the following storage class:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ibm-spectrum-scale-sc
provisioner: spectrumscale.csi.ibm.com
parameters:
  volBackendFs: "fs1"
  clusterId: "215057217487177715"
reclaimPolicy: Delete

Without defining an explicit inodeLimit in the storage class the default number of inodes for a PV created from the storage class is dynamically calculated as follows: volume size / block size of the file system according to Storage class, v2.1.0. This setting changed from CSI v2.0.0 where the default setting was static with inodeLimit = 1 million.

So with the same storage class the IBM Cloud Pak for Data deployment failed in one environment but succeeded in another. The reason was related to the dynamic default inodeLimit calculation and the different block size of the underlying Spectrum Scale file system used for CSI:

# mmlsfs ess3000_1M -B
flag                value                    description
------------------- ------------------------ -----------------------------------
 -B                 1048576                  Block size

The installation of the CP4D control plane ("lite" assembly) created the following PVs from the storage class of 1GiB and 10GiB sizes:

# oc get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM
pvc-048b209e-851f-49f7-9482-a5086daab6b0   10Gi       RWO            Delete           Bound    zen/datadir-zen-metastoredb-0                                 
pvc-4fcff6ae-94bc-46e3-b660-e43432dfb941   10Gi       RWO            Delete           Bound    zen/datadir-zen-metastoredb-2                                 
pvc-6748d517-e0de-4543-b2c5-c9c45d00a271   1Gi        RWX            Delete           Bound    zen/cpd-install-shared-pvc                                    
pvc-78ab6afa-8f29-463d-965b-79967c9ec2cf   10Gi       RWX            Delete           Bound    zen/user-home-pvc                                             
pvc-a6851d97-88ac-4eb6-a889-a92a57ce53af   10Gi       RWO            Delete           Bound    zen/datadir-zen-metastoredb-1                                 
pvc-cea71faf-0586-4744-b27d-d67d1a10cd35   1Gi        RWX            Delete           Bound    zen/cpd-install-operator-pvc                                  
pvc-f59ed123-3ced-4864-9f3d-8587e73b467b   10Gi       RWX            Delete           Bound    zen/influxdb-pvc   

The PVs were backed by the following independent filesets in the IBM Spectrum Scale file system:

# mmlsfileset ess3000_1M -L
Filesets in file system 'ess3000_1M':
Name                            Id      RootInode  ParentId Created                      InodeSpace      MaxInodes    AllocInodes
root                             0              3        -- Mon May 11 20:19:22 2020        0             15490304         500736
spectrum-scale-csi-volume-store  1         524291         0 Thu Apr 15 00:15:25 2021        1              1048576          52224
pvc-cea71faf-0586-4744-b27d-d67d1a10cd35 2 1048579        0 Mon Apr 19 10:23:02 2021        2                 1024           1024
pvc-6748d517-e0de-4543-b2c5-c9c45d00a271 3 1572867        0 Mon Apr 19 10:23:05 2021        3                 1024           1024
pvc-78ab6afa-8f29-463d-965b-79967c9ec2cf 4 2097155        0 Mon Apr 19 10:26:03 2021        4                10240          10240
pvc-f59ed123-3ced-4864-9f3d-8587e73b467b 5 2621443        0 Mon Apr 19 10:26:06 2021        5                10240          10240
pvc-048b209e-851f-49f7-9482-a5086daab6b0 6 3145731        0 Mon Apr 19 10:26:09 2021        6                10240          10240
pvc-a6851d97-88ac-4eb6-a889-a92a57ce53af 7 3670019        0 Mon Apr 19 10:26:12 2021        7                10240          10240
pvc-4fcff6ae-94bc-46e3-b660-e43432dfb941 8 4194307        0 Mon Apr 19 10:26:14 2021        8                10240          10240

The inode limits of these PVs and their related independent filesets can indeed be calculated from the above formula and confirmed with the output from the mmlsfileset -L command:

File System Blocksize = 1 MiB
Volumes sizes (control plane/"lite"): 
   1GiB  =  1024 MiB / 1 MiB = 1024 Inodes
   10GiB = 10240 MiB / 1 MiB = 10240 Inodes

Here the deployment succeeded because of an inode size of 10240 for the 10 GiB PVs because the underlying Spectrum Scale file system was using a block size of only 1 MiB. In the other environment the Spectrum Scale default block size of 4 MiB was used and the number of inodes for the same PVs were reduced to 1/4th (e.g. 2560 inodes for 10 GiB PVs) which was too small to hold the required number of files for the deployment.

Describe the solution you'd like

Using a dynamic inode limit based on (PV-size / file system block size) assumes that we have an average file size similar to the IBM Spectrum Scale file system block size and it does not take into account that IBM Spectrum Scale could still efficiently save more smaller files using subblocks, i.e. the maximum number of files in an IBM Spectrum Scale file system is limited by the overall capacity divided by the inode size (e.g. 4k) plus subblock size (2k...16k depending on the block size). Furthermore, the OpenShift user (or even the OpenShift admin) doesn't even know the block size of the underlying IBM Spectrum Scale file system so such a calculation which is dependent on the block size is not transparent to an OpenShift user, e.g. in one case a 10 GiB PV allows 10240 files in another case it only allows 2560 files even if a similar YAML for the storage class is used. This approach lacks reproducibility from an OpenShift perspective where the block size is not known.

In the example above we used 1 MiB as file system block size while the default block size in IBM Spectrum Scale is even 4 MiB (and can go up to 16 MiB). With the default block size of 4 MiB we could only save 256 files in the 1 GiB volumes or as few as 64 if we would have had a maximum file system block size of 16 MiB. We may even assume that intentional PVC requests for small sized PVs may also align with the need to work on many smaller files (<1 MiB) rather than larger files (>=1 MiB).

We see that the applied inode limit (=number of files that can be created) for a fileset backing a PV is unexpectedly low with regular block sizes of IBM Spectrum Scale of 4 MiB (default) or even higher - especially for small sized volumes. In this example with an automated deployment of applications like CP4D the user does not even have a chance to specify the capacity of the required PVs as these are automatically created based on storage capacity requirements.

An OpenShift admin can either go with a static inode limit (e.g. like 1 million inodes) and have a fixed inode size defined in the storage class for all PVs (created from this storage class) independent of their size or a dynamic inode limit based on the default calculation with ( PV-size / fs-block-size).

When using the latter I would propose to not base the number of inodes on the file system block size. The file system block size does not actually limit the number of files in the fileset as suggested by the formula nor is a calculation based on an unknown parameter in OpenShift (i.e. the block size) comprehensible to an OpenShift user. The OpenShift user does not know the file system block size and may wonder about different outcomes regarding the inode limits for similar storage classes and PV sizes in different environments or on different Spectrum Scale file systems.

One benefit of using storage classes with independent filesets is that independent filesets provide an independent inode space which is not shared with the root file system. So there is not necessarily a need to enforce a scarce inode calculation here.

To reduce "harm" to clients (i.e. running out of "space" = inodes on almost empty volumes) who work with the defaults (e.g. no static inodeLimit defined in the storage class and using a 4 MiB default block size in IBM Spectrum Scale) I would suggest to take the unknown file system block size out of the picture and base the default inodeLimit calculation on parameters that can be understood and comprehended by an OpenShift user from an OpenShift perspective, e.g. by using a fixed number of inodes per GiB of storage capacity, e.g. by defining a dynamically calculated

default inodeLimit = 8 x [PV capacity in MiB] / (1 MiB)

for each PV of the storage class.

This would be based on an average file size of 128 KiB and introduce a linear relation of inodes to PV capacity, e.g. 8192 files per GiB that can easily be understood from an OpenShift user perspective and documented accordingly.

So a user could easily calculate the required size of the PVC by taking the anticipated amount of data (required storage capacity) and the anticipated number of files (required number of inodes) into account without having an unknown parameter like the file system block size in the picture (which is actually not even needed here as it alone does not limit the maximum number of files).

We could even introduce a variable inodeMultiplier as an additional parameter in the storage class to adapt more specifically to customer needs if a customer wants to store more smaller or more larger files compared to the default above, e.g.

default inodeLimit = [inodeMultiplier]  x [PV capacity in MiB] / (1 MiB)

with the default for MULTIPLIER = 8.

A customer could even define multiple storage classes backed by the same file system with different multipliers to accommodate needs for smaller PVs with many files (storage class A) and larger PVs with less files (storage class B).

Describe alternatives you've considered As a static inodeLimit in the storage class would be applied to all PVs created from the storage class independent of the actual size of the PV it is good to have an alternative which is dynamically configured based on the actual storage capacity and calculated from well known parameters for the OpenShift user.

gfschmidt commented 3 years ago

Another realted issue: I can confirm that the inodeLimit in the storage class is indeed honored by CSI v2.1.0. The inodeLimit: "90000" (quotes are required) in the storage class:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: scale-test-sc
provisioner: spectrumscale.csi.ibm.com
parameters:
  volBackendFs: "fs1"
  clusterId: "215057217487177715"
  inodeLimit: "90000"
reclaimPolicy: Delete

leads to a PV

# oc get pv pvc-802d2392-8c1c-44ea-bb60-2c354444f2ba
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                    STORAGECLASS    REASON   AGE
pvc-802d2392-8c1c-44ea-bb60-2c354444f2ba   10Gi       RWX            Delete           Bound    default/scale-test-pvc   scale-test-sc            14s

backed by a fileset with around 90,000 inodes:

# mmlsfileset ess3000_1M pvc-802d2392-8c1c-44ea-bb60-2c354444f2ba -L
Filesets in file system 'ess3000_1M':
Name                            Id      RootInode  ParentId Created                      InodeSpace      MaxInodes    AllocInodes Comment
pvc-802d2392-8c1c-44ea-bb60-2c354444f2ba 15 7864323       0 Wed Apr 21 10:37:42 2021       15                90112          90112 Fileset created by IBM Container Storage Interface driver

as shown here in the last last column before the comment: 90112 inodes .

However, the number of inodes are not correctly reflected from within a pod that has this PV mounted:

# oc rsh ibm-spectrum-scale-test-pod
/ # df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                 445.6G     31.3G    414.3G   7% /
fs1                      10.0G         0     10.0G   0% /data         <<< the Spectrum Scale 10GiB PV: correct PV/fileset size
[...]
/ # df -i
Filesystem              Inodes      Used Available Use% Mounted on
overlay              233615296    870177 232745119   0% /
fs1                   18740480     16426  18724054   0% /data             <<<< the Spectrum Scale 10GiB PV: incorrect no. of inodes
[...]

While the size of the mounted fileset behind the PV under mount point /data is correct with 10GiB we see a wrong free inode number. Instead of 90112 inodes for the fileset pvc-802d2392-8c1c-44ea-bb60-2c354444f2ba in the file system fs1 it shows the total maximum number of inodes for the Spectrum Scale file system fs1 as shown by mmdf:

[root@fscc-sr650-12 CP4D]# mmdf ess3000_1M -F
Inode Information
-----------------
Total number of used inodes in all Inode spaces:              16425
Total number of free inodes in all Inode spaces:            2738135
Total number of allocated inodes in all Inode spaces:       2754560
Total of Maximum number of inodes in all Inode spaces:     18740480      <<< based on this number

With regard to the issue described above this makes it even harder or impossible for an OpenShift user to understand why he or she is getting a "No space left on device" when running out of inodes (i.e. with a low default inodeLimit on independent fileset-based PVs) bacause the actual inode limit (here 90112) is not shown in the container.

otandka commented 3 years ago

Regarding, "Scale could still efficiently save more smaller files using subblocks", another factor is that the smallest files can be stored in the inode. Files less than 3900 bytes or so, require zero subblocks, further distorting the relationship between PV size and inode usage. I'd suggest that the relationship is not linear, so that the average file size is likely to increase as the PV size increases. For the default computation, applying a minimum value of a few thousand inodes for the smallest PV sizes might help.

gfschmidt commented 3 years ago

So if we suggest it should be dynamic but not linear with PV size then I agree linear might not be the best option here as the estimated avg file size would increase too heavily with the requested capacity size of the PV. I initially proposed the linear approach as an example to enhance the current calculation based on the file system block size because it would be simple for an OpenShift user to understand when requesting a PV based on the required capacity (MiB) or the inodes (i.e. related to a fixed number of inodes per MiB).

If we want to base the inode limit dynamically on the PV size but not linearly then I think a formular like the one below might indeed be more appropriate for the dynamic calculation of the inode limit per PV/fileset:

  1. Assumption: minimum PV size is 1 GiB with 4kB estimated avg file size (256k inode limit)

  2. Formular/calculation of inode limit:

                   (1024x1024)
    inodeLimit = ----------------------------   x   [SIZE in GiB]
             4^(1+0.3*log2([SIZE in GiB])

    which yields

    1GiB PV ->     262,144 inodes / 4kB avg file size
    1TiB PV ->   4,194,304 inodes / 256kB avg file size
    1PiB PV ->  67,108,864 inodes / 16384kB avg file size

    We can adjust the multiplier "0.3" in front of the "log2()" accordingly to increase less or more heavily with the order of magnitude of the requested capacity...

deeghuge commented 3 years ago

As per the recommendation from core side following was decided. If volume is less than 10GB then 100K maxinode or volume large than 10GB then 200K maxInode setting. In future we are working with core to remove the inode management from container users and manage at scale level internally.

deeghuge commented 2 years ago

This is being fixed as part of gpfs core. No action needed from CSI side. Please reopen if required.