fluxcd / flux2

Open and extensible continuous delivery solution for Kubernetes. Powered by GitOps Toolkit.
https://fluxcd.io
Apache License 2.0
6.52k stars 602 forks source link

Document how Flux works with persistent volumes #2279

Closed kingdonb closed 6 months ago

kingdonb commented 2 years ago

Describe the bug

We need some documentation around how to use persistent volumes in an environment with Flux, since our documentation basically doesn't mention or cover persistent volumes at all.

We have seen folks misunderstanding the intricacies of how to use (or maybe they found a bug) how to use persistent volumes on a cluster where the volumes may live in or outside of the cluster, and maybe clusters are being deleted and restored, where there are some expectations around how persistent volumes act outside of the lifecycle of a cluster.

Some of this may vary based on storage class or PV provider, an NFS provider that uses an underlying storage provider or lives off the cluster altogether is each going to behave differently than say Portworx, or OpenEBS, or a local-path-provisioner, or anything else. We should provide some guidance for at least a restricted subset of these modes, ideally based on what's most popular with Flux users? There are also commercial solutions that can back up and restore volumes, although I'm not as familiar with those, and unsure if those are in scope for us to document, but if they are popular with our users I guess we'd want to think seriously about covering it with our own docs that cover GitOps specific issues, perhaps add with "Use Cases."

Steps to reproduce

The major challenge is that advice may vary from one provider to the next. For example, many providers will support some mechanism for snapshotting volumes, but others might not. So it will be tricky to provide good generic advice.

We can start with simple common production use cases on the cloud providers, and that should include an understanding that cluster operators generally don't create a PV explicitly, but let it be satisfied by a PVC through the cloud controller manager instead, which creates persistent volumes dynamically based on claims requests.

A common problem that we might cover is, if a user wants to mark a volume as important and ensure that it persists rather than being deleted in cases where they are accidentally deleted through their claim on a cluster, (which could be easy to do), this story also especially surrounding Helm controller where there are many charts that deploy a volume, and you might want to mention it by name in an existingClaim ref; this and many other common circumstances we could document and cover.

It's fairly straightforward setting ReclaimPolicy to Retain – but it may not be obvious for everyone how to incorporate a dynamically generated persistent volume into a Git config repository, or how to use GitOps to patch them with a Retain policy.

Expected behavior

If the person who posted here was operating under some misunderstanding about how to handle persistent volumes, how do we express that in the form of a doc? So when this question comes again we should have some document we can refer users to which will clarify generally expectations around how persistent volumes will work with Flux and what all, generally speaking, there is to know about how to work with PVs and Flux.

If there is more to say than "set spec.volumeName so it matches your volume and it should work" depending on what persistent volume provider is in use, then let's figure out what information our users will need to know in order to work with those different persistent volume providers. I don't work with many different PV providers so honestly don't know how much variability there is in how these systems work from one provider to the next.

This advice (from the linked conversation above, some of which is described again in the body of this report) worked for me on my local-path-provisioner and it worked for at least this one other user having some type of NFS volume claims on their cluster.

I have a feeling these are somewhat hobbyist style configurations (well maybe not NFS at-large, but the local-path-provisioner is surely not suited for High Availability deployments or probably production use at all) and the more important use cases to document are likely cloud providers and more advanced storage providers that handle storage in a distributed way and that don't balk at migrating PVs from one node to another.

I am interested to know in particular myself how these concerns should be addressed in those environments too, in other words, if a volume outlives a cluster and should be transferred to another on any cloud provider, like EBS on AWS or block storage on any other cloud provider – is it generally the same advice? "Simply ensure the Retain policy is in use and set spec.volumeName to pull in the volume by uuid when it has been released from the other cluster"?

Is this a common use case that many of our users would be interested in, and if so can we cover it broadly without going into too much detail about cloud provider specific details, (or would that be too much less valuable than documenting the differences from one provider to the next, and should we go into all that detail for every provider if necessary?)

Screenshots and recordings

No response

OS / Distro

N/A

Flux version

N/A

Flux check

N/A

Git provider

No response

Container Registry provider

No response

Additional context

No response

Code of Conduct

oxr463 commented 1 year ago

I’d love to take a crack at this, since I’m currently working on trying to reference the volumeHandle of an EBS volume for a Gitaly StatefulSet from the GitLab Helm Chart with Flux.

kingdonb commented 1 year ago

At some point last year I actually managed to give a conference talk on this subject, (but I don't think it serves the docs very well...)

If you have an idea where you'd like to start, I'll be happy to review something! Are you thinking to add a new entry in "Use Cases"?

oxr463 commented 1 year ago

If you have an idea where you'd like to start, I'll be happy to review something! Are you thinking to add a new entry in "Use Cases"?

Yeah, that sounds like a good place to start.

comminutus commented 12 months ago

What's the recommended way to restore Longhorn backups/system backups? I deploy Longhorn via Flux as part of my cluster. Ideally I'd like to restore all of my Longhorn PersistentVolumeClaims (PVCs) and PersistentVolumes before Flux deploys the applications which reference those PVCs.

I could add a SystemRestore resource from the Longhorn CRDs, but the trouble with that it assumes a restore is available. In the situation where the cluster is being created for the first time with no data, it would fail. I'm not aware of any way of conditionally running a Kustomization (for example, if it's a new cluster then don't run a restore and if it's an old cluster run the restore).

Clusters which don't use a data provisioner that's deployed with Flux probably don't run into this issue. For a scenario of rebuilding a cluster, after provisioning nodes the data provisioner would likely restore data, PVs, PVCs ,etc., and then one could run Flux bootstrap after the PVCs and PVs are created if they are available.

However, since Longhorn is usually deployed with Flux, I'm not sure how to accomplish the same task. Perhaps I could install Longhorn via Helm directly, restore my system backup, then once complete, bootstrap Flux? Just wondering what others do here.

kingdonb commented 6 months ago

There has been for some time now a section in the "Flux vertical scaling" installation config guide:

https://fluxcd.io/flux/installation/configuration/vertical-scaling/

Enabling persistent storage for internal artifacts:

https://fluxcd.io/flux/installation/configuration/vertical-scaling/#persistent-storage-for-flux-internal-artifacts

kingdonb commented 6 months ago

@comminutus I don't think this doc addresses your workflow concern specifically, maybe we could open a separate issue - I think the persistent volume configuration outside of Flux's own internal resources is an expansion of the scope, and it should not fall on Flux to document all of the strange configurations that may exist across every cloud provider.

But if there is some specific workflow we could document to make the experience better, besides the new scaling docs mentioned above, maybe we can have a new discussion about it?