Open cfillot opened 6 months ago
Just for understanding, are any snapshots taken (either in K8s or directly in Ontap)?
In any case, the snapshotReserve parameter of the backend can be used to make the hosting volume a bit larger than the PVC itself to accommodate for snapshots (hence the name) but also any other metadata etc. You currently have it set to 0%, try setting it to 10 or 20 (unless you make heavy use of snapshots/clones along with high data change rates, then you might need more). Note that this is a percentage calculated with the Ontap volume size as the base.
Thanks a lot for your suggestion! No, I don't use snapshots at all on these PVC/volumes, so I chose to set 0%. I didn't know this snapshot reserve space could also be used for metadata, I'll try that and let you know.
We have the same setup except we use ubuntu 22.04 and I can confirm we have the same problem. For us , the 100% way to trigger it is to fill the drive, then delete the data and try to fill again.
We have a reproducer using FIO random write of a 16g file to a 20G volume.
The problem happens on overwrite. We can see same bahaviour using iSCSI when the volume is not mounted using the discard ext4 option. With discard active, overwriting is working fine for iSCSI. However discard does not work for us at all for NVMEoTCP and after reaching out to NetApp, we were told that it is not yet supported but we should be able to use thick provisionined volumes. We always had spaceAllocation: "true" on the trident backend but did not have the ext4 mount option.
NetApp told us to use thick provisioning, but setting spaceReserve: volume is not enough as it sets thick provisioning only on the volume level but our internal NetApp storage experts told us that in order to reach 100% overwrite capability we need thick provisioning also for the LUN. From a peek into the trident source the LUN is always thin-provisioned. We are changing the parameter manually.
This is currently a blocker for broader production use and we use trident and NetApp only for less important things as we never understood this issue completely.
Also, I should add, that this is all completely without snapshots.
Hi @cfillot, could you please provide us the following information to root cause this issue?
Describe the bug
When creating (and using) PVC based on NVMe/TCP backends, I get errors or alerts on a regular basis from the Netapp storage arrays like these:
I can manually resize volumes directly on the storage arrays to avoid these errors but of course it does not scale. As far as I understand it, 10% of additional space is added to the volume size for metadata, but it seems that this is not enough.
NVMe backends are configured as follows:
Storage Class:
PVC:
BTW, if I don't specify "spaceReserve: volume" in backends, filling the EXT4 filesystem in a Pod makes the volume Read-Only (whereas this should only give a "Not enough space on device").
Environment