NetApp / trident

Storage orchestrator for containers
Apache License 2.0
762 stars 222 forks source link

Deleting an imported PVC / PV backed by Trident leaves other Kubernetes resources and stays in cache #813

Closed rikgig closed 3 weeks ago

rikgig commented 1 year ago

Describe the bug We are doing DR exercices that involves importing a volume multiple times in our OpenShift. When we delete the PVC from OpenShift, the volumes stays, which is ok. When we delete the volume, it cannot be imported again. We found that there is also a TridentVolume resource that stays dangling, so we deleted it. Still the import is not working. So last resort we recycled all the pods of Trident and then the volume was imported.

Environment

To Reproduce Delete a PVC and the PV. Try running the import function from the cli, the import will fail Delete the TridentVolume resource, retry the import, will fail Recycle the pods, retry the import, will succeed.

Expected behavior At least, after deleting the TridentVolume, it should work.

Additional context

gnarl commented 1 year ago

@rikgig, it sounds like you are using a "retain" reclaim policy so that the volume doesn't get deleted. Trident maintains a CR (tvol) to keep track of this volume in the case where the reclaim policy is later changed to "delete".

Instead of trying to import the volume again you may be able to use a new PVC to statically provision the volume. Let us know if this would work for your particular use case.

kargobay commented 1 year ago

If you delete the tvol via kubectl, Trident does not notice it. Therefore it stays in the internal data structures - until you restart the Trident pod. At that point it will rebuild the internal data structure based on the actual CRD resources in the cluster. As an alternative, you can delete the trident volume via "tridenctl delete volume". That way it is deleted AND Trident is aware of it. You can then re-import it again. Not sure if this is intended behavior or not.

rikgig commented 1 year ago

Hi @gnarl Thanks for the answer. We were using the Retain policy yes. Our use case is that we have a volume on a primary site that is linked to another remote volume on a DR site. Both are linked with Snapmirror. When disaster occurs, we break the snap mirror relationship, import the volume in the DR site cluster and continue running with it. When the primary site goes up again, we reestablish the link, update the primary on continue on this site now back on the primary. Our volumes are tend to be quite massive over time so we want to keep them in synch. That import we run is with a new PVC, always. But that error is seen when we fail to do the manual cleanup described.

rikgig commented 1 year ago

@kargobay Thanks for the answer. If we do the tridentctl delete volume, will that volume be "really" deleted from the storage? We want to keep it alive so that we don't have to redo the full transfer from the primary site to the secondary since this volume will be quite massive and will take time (and network costs) to redo it. It's really just in kubernetes that we want to re-import. Unless importing it once would be enough? If you look at our use case described in my message above, we shutdown the secondary (dr site) once the primary is back online and we resync the primary with what has been done on the secondary.

kargobay commented 1 year ago

@rikgig That depends a bit on how you import the volume.

Import with no-manage option

With that option, Trident will not interfer with the Ontap volume in any way, e.g, it will also not delete it. So your workflow is as follow:

Import without no-manage option

In this case, deleting the volume via tridentctl will also delete the underlying Ontap volume - as you've asked Trident to takeover the lifecycle management for that volume. You can still achieve what you want by following this workflow:

All of that said, may I suggest you take a look at Astra Control, which adds data management capabilities on top of Trident? It has Snapmirror support built-in, e.g. it will setup the replication for you, automatically imports the destination volume into the DR cluster (as a "warm standby" essentially) and in case of disaster you can fail over by the click of a button or a single API call. It also provides a controlled failover where it shuts down the app, performs a final snapmirror update and then brings everything up on the other site. Let me know if you'd like more details on that, I don't want to "abuse" this issue too much ;-)

rikgig commented 1 year ago

Hi @kargobay Thanks for the answer. We will try this. But I'm still wondering, the fact that we deleted the TridentVolume, maybe the controller should still have been able to see that this volume was now available for import? Recycling them could be troublesome in high traffic environments. Just wondering...

sjpeeris commented 3 weeks ago

@rikgig Can you let us know if you have been able to resolve this issue on your end. If so, please close the issue.

rikgig commented 3 weeks ago

Issue was not totally resolve but the project has been halted. Next time we'll see if we still have issues with this function.