Closed gfschmidt closed 2 years ago
Another realted issue:
I can confirm that the inodeLimit
in the storage class is indeed honored by CSI v2.1.0.
The inodeLimit: "90000"
(quotes are required) in the storage class:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: scale-test-sc
provisioner: spectrumscale.csi.ibm.com
parameters:
volBackendFs: "fs1"
clusterId: "215057217487177715"
inodeLimit: "90000"
reclaimPolicy: Delete
leads to a PV
# oc get pv pvc-802d2392-8c1c-44ea-bb60-2c354444f2ba
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-802d2392-8c1c-44ea-bb60-2c354444f2ba 10Gi RWX Delete Bound default/scale-test-pvc scale-test-sc 14s
backed by a fileset with around 90,000 inodes:
# mmlsfileset ess3000_1M pvc-802d2392-8c1c-44ea-bb60-2c354444f2ba -L
Filesets in file system 'ess3000_1M':
Name Id RootInode ParentId Created InodeSpace MaxInodes AllocInodes Comment
pvc-802d2392-8c1c-44ea-bb60-2c354444f2ba 15 7864323 0 Wed Apr 21 10:37:42 2021 15 90112 90112 Fileset created by IBM Container Storage Interface driver
as shown here in the last last column before the comment: 90112 inodes .
However, the number of inodes are not correctly reflected from within a pod that has this PV mounted:
# oc rsh ibm-spectrum-scale-test-pod
/ # df -h
Filesystem Size Used Available Use% Mounted on
overlay 445.6G 31.3G 414.3G 7% /
fs1 10.0G 0 10.0G 0% /data <<< the Spectrum Scale 10GiB PV: correct PV/fileset size
[...]
/ # df -i
Filesystem Inodes Used Available Use% Mounted on
overlay 233615296 870177 232745119 0% /
fs1 18740480 16426 18724054 0% /data <<<< the Spectrum Scale 10GiB PV: incorrect no. of inodes
[...]
While the size of the mounted fileset behind the PV under mount point /data
is correct with 10GiB we see a wrong free inode number. Instead of 90112 inodes for the fileset pvc-802d2392-8c1c-44ea-bb60-2c354444f2ba in the file system fs1 it shows the total maximum number of inodes for the Spectrum Scale file system fs1 as shown by mmdf
:
[root@fscc-sr650-12 CP4D]# mmdf ess3000_1M -F
Inode Information
-----------------
Total number of used inodes in all Inode spaces: 16425
Total number of free inodes in all Inode spaces: 2738135
Total number of allocated inodes in all Inode spaces: 2754560
Total of Maximum number of inodes in all Inode spaces: 18740480 <<< based on this number
With regard to the issue described above this makes it even harder or impossible for an OpenShift user to understand why he or she is getting a "No space left on device" when running out of inodes (i.e. with a low default inodeLimit on independent fileset-based PVs) bacause the actual inode limit (here 90112) is not shown in the container.
Regarding, "Scale could still efficiently save more smaller files using subblocks", another factor is that the smallest files can be stored in the inode. Files less than 3900 bytes or so, require zero subblocks, further distorting the relationship between PV size and inode usage. I'd suggest that the relationship is not linear, so that the average file size is likely to increase as the PV size increases. For the default computation, applying a minimum value of a few thousand inodes for the smallest PV sizes might help.
So if we suggest it should be dynamic but not linear with PV size then I agree linear might not be the best option here as the estimated avg file size would increase too heavily with the requested capacity size of the PV. I initially proposed the linear approach as an example to enhance the current calculation based on the file system block size because it would be simple for an OpenShift user to understand when requesting a PV based on the required capacity (MiB) or the inodes (i.e. related to a fixed number of inodes per MiB).
If we want to base the inode limit dynamically on the PV size but not linearly then I think a formular like the one below might indeed be more appropriate for the dynamic calculation of the inode limit per PV/fileset:
Assumption: minimum PV size is 1 GiB with 4kB estimated avg file size (256k inode limit)
Formular/calculation of inode limit:
(1024x1024)
inodeLimit = ---------------------------- x [SIZE in GiB]
4^(1+0.3*log2([SIZE in GiB])
which yields
1GiB PV -> 262,144 inodes / 4kB avg file size
1TiB PV -> 4,194,304 inodes / 256kB avg file size
1PiB PV -> 67,108,864 inodes / 16384kB avg file size
We can adjust the multiplier "0.3" in front of the "log2()" accordingly to increase less or more heavily with the order of magnitude of the requested capacity...
As per the recommendation from core side following was decided. If volume is less than 10GB then 100K maxinode or volume large than 10GB then 200K maxInode setting. In future we are working with core to remove the inode management from container users and manage at scale level internally.
This is being fixed as part of gpfs core. No action needed from CSI side. Please reopen if required.
Is your feature request related to a problem? Please describe. While deploying IBM Cloud Pak for Data v3.5.2 (CP4D) with CNSA v5.1.0.3 and CSI v2.1.0 on OpenShift 4.5 and 4.6 in different environments we observed circumstances where the CP4D control plane ("lite" assembly) did not succeed to install due to a PVC (of one of the subcomponents) running out of inodes:
The PVC was backed by an independent fileset created from the following storage class:
Without defining an explicit
inodeLimit
in the storage class the default number of inodes for a PV created from the storage class is dynamically calculated as follows:volume size / block size of the file system
according to Storage class, v2.1.0. This setting changed from CSI v2.0.0 where the default setting was static withinodeLimit = 1 million
.So with the same storage class the IBM Cloud Pak for Data deployment failed in one environment but succeeded in another. The reason was related to the dynamic default
inodeLimit
calculation and the different block size of the underlying Spectrum Scale file system used for CSI:The installation of the CP4D control plane ("lite" assembly) created the following PVs from the storage class of 1GiB and 10GiB sizes:
The PVs were backed by the following independent filesets in the IBM Spectrum Scale file system:
The inode limits of these PVs and their related independent filesets can indeed be calculated from the above formula and confirmed with the output from the
mmlsfileset -L
command:Here the deployment succeeded because of an inode size of 10240 for the 10 GiB PVs because the underlying Spectrum Scale file system was using a block size of only 1 MiB. In the other environment the Spectrum Scale default block size of 4 MiB was used and the number of inodes for the same PVs were reduced to 1/4th (e.g. 2560 inodes for 10 GiB PVs) which was too small to hold the required number of files for the deployment.
Describe the solution you'd like
Using a dynamic inode limit based on
(PV-size / file system block size)
assumes that we have an average file size similar to the IBM Spectrum Scale file system block size and it does not take into account that IBM Spectrum Scale could still efficiently save more smaller files using subblocks, i.e. the maximum number of files in an IBM Spectrum Scale file system is limited by the overall capacity divided by the inode size (e.g. 4k) plus subblock size (2k...16k depending on the block size). Furthermore, the OpenShift user (or even the OpenShift admin) doesn't even know the block size of the underlying IBM Spectrum Scale file system so such a calculation which is dependent on the block size is not transparent to an OpenShift user, e.g. in one case a 10 GiB PV allows 10240 files in another case it only allows 2560 files even if a similar YAML for the storage class is used. This approach lacks reproducibility from an OpenShift perspective where the block size is not known.In the example above we used 1 MiB as file system block size while the default block size in IBM Spectrum Scale is even 4 MiB (and can go up to 16 MiB). With the default block size of 4 MiB we could only save 256 files in the 1 GiB volumes or as few as 64 if we would have had a maximum file system block size of 16 MiB. We may even assume that intentional PVC requests for small sized PVs may also align with the need to work on many smaller files (<1 MiB) rather than larger files (>=1 MiB).
We see that the applied inode limit (=number of files that can be created) for a fileset backing a PV is unexpectedly low with regular block sizes of IBM Spectrum Scale of 4 MiB (default) or even higher - especially for small sized volumes. In this example with an automated deployment of applications like CP4D the user does not even have a chance to specify the capacity of the required PVs as these are automatically created based on storage capacity requirements.
An OpenShift admin can either go with a static inode limit (e.g. like 1 million inodes) and have a fixed inode size defined in the storage class for all PVs (created from this storage class) independent of their size or a dynamic inode limit based on the default calculation with
( PV-size / fs-block-size)
.When using the latter I would propose to not base the number of inodes on the file system block size. The file system block size does not actually limit the number of files in the fileset as suggested by the formula nor is a calculation based on an unknown parameter in OpenShift (i.e. the block size) comprehensible to an OpenShift user. The OpenShift user does not know the file system block size and may wonder about different outcomes regarding the inode limits for similar storage classes and PV sizes in different environments or on different Spectrum Scale file systems.
One benefit of using storage classes with independent filesets is that independent filesets provide an independent inode space which is not shared with the root file system. So there is not necessarily a need to enforce a scarce inode calculation here.
To reduce "harm" to clients (i.e. running out of "space" = inodes on almost empty volumes) who work with the defaults (e.g. no static
inodeLimit
defined in the storage class and using a 4 MiB default block size in IBM Spectrum Scale) I would suggest to take the unknown file system block size out of the picture and base the defaultinodeLimit
calculation on parameters that can be understood and comprehended by an OpenShift user from an OpenShift perspective, e.g. by using a fixed number of inodes per GiB of storage capacity, e.g. by defining a dynamically calculatedfor each PV of the storage class.
This would be based on an average file size of 128 KiB and introduce a linear relation of inodes to PV capacity, e.g. 8192 files per GiB that can easily be understood from an OpenShift user perspective and documented accordingly.
So a user could easily calculate the required size of the PVC by taking the anticipated amount of data (required storage capacity) and the anticipated number of files (required number of inodes) into account without having an unknown parameter like the file system block size in the picture (which is actually not even needed here as it alone does not limit the maximum number of files).
We could even introduce a variable
inodeMultiplier
as an additional parameter in the storage class to adapt more specifically to customer needs if a customer wants to store more smaller or more larger files compared to the default above, e.g.with the default for
MULTIPLIER
= 8.A customer could even define multiple storage classes backed by the same file system with different multipliers to accommodate needs for smaller PVs with many files (storage class A) and larger PVs with less files (storage class B).
Describe alternatives you've considered As a static
inodeLimit
in the storage class would be applied to all PVs created from the storage class independent of the actual size of the PV it is good to have an alternative which is dynamically configured based on the actual storage capacity and calculated from well known parameters for the OpenShift user.