NetApp / trident

Storage orchestrator for containers
Apache License 2.0
758 stars 222 forks source link

Cannot re-size PVCs when a manually created clone for the underlying volume exists #345

Closed jhindulak closed 4 months ago

jhindulak commented 4 years ago

Describe the bug

We ran into an issue today when attempting to re-size one of our volumes. Kubernetes accepted the PVC Edit but the volume stayed at its original size. Inspecting the volume claim listed the following events:

Events:
  Type     Reason             Age                  From               Message
  ----     ------             ----                 ----               -------
  Warning  ExternalExpanding  3m8s                 volume_expand      Ignoring the PVC: didn't find a plugin capable of expanding the volume; waiting for an external controller to process this PVC.
  Warning  ResizeFailed       39s (x2 over 3m7s)   netapp.io/trident  failed in resizing the volume or PV: unable to resize the volume: volume trident_rd_prod_default_jenkins_d7cb2 does not exist
  Warning  ExternalExpanding  39s (x2 over 2m39s)  volume_expand      Ignoring the PVC: didn't find a plugin capable of expanding the volume; waiting for an external controller to process this PVC.

Inspecting the trident logs, we were seeing the same thing:

time="2020-02-18T18:21:25Z" level=info msg="GetBackend information." backend="&{0xc4201b81e0 trident-rd-prod true online map[aggr1_N1:0xc4206e6bc0] map[default-jenkins-d7cb2:0xc420328f00 ...]}" ... backendExternal.Name=trident-rd-prod backendExternal.State=online
time="2020-02-18T18:21:25Z" level=info msg="GetBackend information." backend="&{0xc42059d1e0 trident-rd-prod-bootstrap true online map[aggr5_N1:0xc42046ec00] map[]}" ... backendExternal.Name=trident-rd-prod-bootstrap backendExternal.State=online
time="2020-02-18T18:21:55Z" level=error msg="Unable to resize the volume." backend=trident-rd-prod current_size=53687091200 error="volume trident_rd_prod_default_jenkins_d7cb2 does not exist" new_size=107374182400 volume=default-jenkins-d7cb2 volume_internal=trident_rd_prod_default_jenkins_d7cb2
time="2020-02-18T18:21:55Z" level=warning msg="Unable to clean up artifacts of volume resize: unable to resize the volume: volume trident_rd_prod_default_jenkins_d7cb2 does not exist. Repeat resizing the volume or restart trident."
time="2020-02-18T18:21:55Z" level=error msg="Kubernetes frontend failed in resizing the volume or PV: unable to resize the volume: volume trident_rd_prod_default_jenkins_d7cb2 does not exist" PVC=jenkins

tridentctl get volume was able to find the volume, and it was not orphaned:

items:
- Config:
    accessInformation:
      nfsPath: /trident_qtree_pool_trident_rd_prod_SXJDJXBCFC/trident_rd_prod_default_jenkins_d7cb2
      nfsServerIp: ***
    accessMode: ReadWriteOnce
    blockSize: ""
    cloneSourceSnapshot: ""
    cloneSourceVolume: ""
    cloneSourceVolumeInternal: ""
    encryption: ""
    fileSystem: ext4
    internalName: trident_rd_prod_default_jenkins_d7cb2
    name: default-jenkins-d7cb2
    protocol: file
    securityStyle: ""
    size: "53687091200"
    spaceReserve: ""
    splitOnClone: ""
    storageClass: trident-default
    version: "1"
  backend: trident-rd-prod
  orphaned: false
  pool: aggr5_N1

We turned on debug logging for trident and found this interesting log line:

time="2020-02-18T19:30:03Z" level=debug msg="Attempting to acquire shared lock (prune)." lock=e51b38cd-9a63-11e8-80d6-00a0988d169a-trident_rd_prod
time="2020-02-18T19:30:03Z" level=debug msg="Logged EMS message." driver=ontap-nas-economy
time="2020-02-18T19:30:03Z" level=debug msg="Started quota resize." flexvol=trident_qtree_pool_trident_rd_prod_CBUJTDOJOR
time="2020-02-18T19:30:03Z" level=debug msg="Started quota resize." flexvol=trident_qtree_pool_trident_rd_prod_SXJDJXBCFC
time="2020-02-18T19:30:04Z" level=debug msg="Error resizing quotas." error="API status: failed, Reason: No valid quota rules found in quota policy default for volume trident_qtree_pool_trident_rd_prod_SXJDJXBCFC_clone_10022020_163446_87 in Vserver cnas02-trident. , Code: 14958" flexvol=trident_qtree_pool_trident_rd_prod_SXJDJXBCFC_clone_10022020_163446_87

After taking the clone offline and re-starting trident, we were able to resize the volume successfully. I am not sure if the conflict is due to the way the clone was named or if creating a quota rule on the clone would have corrected the issue. If this error was logged by default (instead of being put behind the -debug switch) it would have saved us quite a bit of troubleshooting.

Environment

To Reproduce Steps to reproduce the behavior:

  1. Install Trident and configure an ontap-nas-economy backend
  2. Create a Volume on the ontap-nas-economy backend
  3. In NetApp, create a snapshot of the flexvol that houses the volume
  4. Try to re-size the volume in kubernetes

Expected behavior The volume is re-sized successfully

Additional context

The volume driver we're using doesn't appear to support creating clones directly from trident, so we created the clone on the NetApp directly.

vasum0406 commented 5 months ago

The issue is not reproducible in 24.06 release.