Argo CD: PersistentVolumeClaims for volumes

docent-net commented 4 years ago

Is your feature request related to a problem? Please describe.

I'm trying to mount remote storage to the argocd repo-server. I can provide volumeMounts list as well as volumes list, but I can't create PVC for those volumes using helm chart.

Describe the solution you'd like

volumes.yaml could contain a definition of StorageClass name for PVC. When provided, the volume mount would be created on top of this PVC.

Describe alternatives you've considered

Alternatively, documentation could be extended with information about using PVC and manual creation of PVC.

Additional context

I can see, that similar feature is already implemented in redis-ha: https://github.com/argoproj/argo-helm/blob/d7da8e863f30a6975b89a1bba34855d83bed59e3/charts/argo-cd/charts/redis-ha/templates/redis-ha-statefulset.yaml#L285

Also, this request is only about repo-server, but similar could be requested for Redis and other components.

If this looks like something useful, I can provide PR.

tlvenn commented 3 years ago

I think this is more than useful and should definitely comes outside of the box. @docent-net any chance you can get a stab at this fairly soon ?

docent-net commented 3 years ago

Sure, on it :)

docent-net commented 3 years ago

@tlvenn I need to ask a question, about this. Read through ArgoCD docs (e.g. here https://argoproj.github.io/argo-cd/operator-manual/high_availability/#argocd-repo-server) and it looked like a straightforward thing - just create volumeClaimTemplates with proper configuration for the PVC and that's all.

However, repo-server can be replicated, so here it hit me, that maybe we'd need a StatefulSet here instead of Deployment, because there is some state.. or not? Unfortunately, I'm not that familiar w/ArgoCD HA, but after going through the documentation and thinking about it I have 2 assumptions to discuss:

ArgoCD is rather a stateless app (but for the Redis, which is a cache, so its data can be lost). We don't really care about contents of the repo-server, and the only "important" directory is /tmp, where repos are cloned. "important" because the only reason we care about mounting it from the remote repo is that its size can grow and Pod disk can run out of space. In this case, when Pod is lost for some reason (node down, Pod was evicted by scheduler) underlying storage can be safely removed by k8s, as it will not be used anymore and its contents are not important. In this case, we can stay with Deployment resource as it's currently defined and basically create new PVs via volumeClaimTemplates.
We care about the state and contents of /tmp and we need to replace Deployment w/Statefulset to be sure, that replicas use the same storage as before.

After writing this whole post I think that the correct answer is nr.1, but just need to validate it w/you :) Thanks

tlvenn commented 3 years ago

Hi @docent-net , I am no expert when it comes to argo cd but my feeling is that the repo server is a cache layer and I would opt for option 1 as well.

tlvenn commented 3 years ago

For illustration purpose, not saying this a good idea, Grafana chart for example supports both approaches:

https://github.com/grafana/helm-charts/blob/main/charts/grafana/templates/deployment.yaml#L1 https://github.com/grafana/helm-charts/blob/main/charts/grafana/templates/statefulset.yaml#L1

docent-net commented 3 years ago

I'm going w/opt1 - lets face it - the data is stored in /tmp, so let's treat it as temporary :) Thanks, will get into it

github-actions[bot] commented 3 years ago

Stale issue message

tlvenn commented 3 years ago

@docent-net did you make any progress on this ?

jbartlet commented 3 years ago

I'm a little confused...I am trying to deploy ArgoCD in HA mode and use a PVC for the repo-server to avoid the disk consumption issue, described above. Of course, there is an issue with one PVC. Option 1 above mentions leave as a Deployment and use volumeClaimTemplates....but those aren't supported for Deployments, only SS. Am I missing something? Do I just convert to SS?

ramanNarasimhan77 commented 3 years ago

Can this issue be re-opened? Right now the only possible option without making changes to the helm chart is to use emptyDir

kaidobit commented 2 years ago

Please reopen this issue:

I'm using Longhorn replicated blockstorage, indeed it is my default storageclass and I would love to make use of it for argocd. This should be the default behavior at least for non-HA, @tlvenn gave a good example for a working non-HA/HA Chart:

For illustration purpose, not saying this a good idea, Grafana chart for example supports both approaches:

https://github.com/grafana/helm-charts/blob/main/charts/grafana/templates/deployment.yaml#L1 https://github.com/grafana/helm-charts/blob/main/charts/grafana/templates/statefulset.yaml#L1

Furthermore as @docent-net already pointed out, this Chart is partially making use of the ability to set a custom storageclass:

Additional context

I can see, that similar feature is already implemented in redis-ha:

https://github.com/argoproj/argo-helm/blob/d7da8e863f30a6975b89a1bba34855d83bed59e3/charts/argo-cd/charts/redis-ha/templates/redis-ha-statefulset.yaml#L285

snuggie12 commented 2 years ago

https://kubernetes.io/docs/concepts/storage/_print/#generic-ephemeral-volumes is alpha in 1.19, beta in 1.21 and stable in 1.23.

pierluigilenoci commented 2 years ago

Please reopen this issue.

pierluigilenoci commented 2 years ago

@mkilchhofer could you please take a look?

mkilchhofer commented 2 years ago

@pierluigilenoci I don't understand why someone wants to configure PV on Argo CD components.

We only implement use cases which are supported by the upstream project itself (https://github.com/argoproj/argo-cd).

One thing I am okay with is to add the ability to deploy any kind of resources (extra manifests) like the bitnami charts do.

snuggie12 commented 2 years ago

@mkilchhofer quoted from the official docs:

argocd-repo-server clones repository into /tmp ( of path specified in TMPDIR env variable ). Pod might run out of disk space if have too many repository or repositories has a lot of files. To avoid this problem mount persistent volume.

The problem is they officially tell you to do something without telling you how to.

FWIW, I'm using my suggested approach of ephemeral volumes while keeping it a Deployment but I know that's not available to everyone since it's in 1.21+ as beta:

volume for tmp for repo-server

``` spec: template: spec: volumes: - ephemeral: volumeClaimTemplate: metadata: creationTimestamp: null labels: type: argocd-repo-server-tmp spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi storageClassName: standard-ssd volumeMode: Filesystem ```

mkilchhofer commented 2 years ago

ACK. That makes sense. There's already an issue upstream for that: https://github.com/argoproj/argo-cd/issues/7927

pierluigilenoci commented 2 years ago

@mkilchhofer I don't have a particular fetish for PVs.

I just wish that in the presence of large caches, a situation aggravated also by some bug (https://github.com/argoproj/argo-cd/issues/4002 and https://github.com/helm/helm/issues/10583), the Repo Server pods are not evicted ('Pod The node had condition: [DiskPressure]. ') or, even worse, other pods are evicted in the same node (exactly what happened in my case).

In any case the idea of allowing the deployment of extra manifests is always a good idea (and if you look at my history on GitHub it is something I particularly appreciate).

Thanks for your help and your time.

pierluigilenoci commented 2 years ago

It might make sense to link the CNCF's Slack (stranded) discussion as well: https://cloud-native.slack.com/archives/C020XM04CUW/p1653511733823829

pierluigilenoci commented 2 years ago

@docent-net any update on this?

docent-net commented 2 years ago

@pierluigilenoci sure. I've been following this discussion (as well as that mentioned on Slack and regarding StatefulSets) and I'm not sure right now.

This issue is about Deployment. My original problem was that having replica=1 for Deployment type, I wanted to have persistent storage for the git cache. Thanks to that, in case of POD replacement, my infra wouldn't be impacted with traffic/performance spike due to cache rebuilding / git repos pulling on a higher scale.

And I have implemented that, and it works on my cluster correctly. But I have only one replica.

Now - with replicas>1 there's problem of AccessModes, that might be very specific. Having it set to ReadWriteMany (and that is a requirement for this workload), could mean a huge performance drop due to underlying file system configurations.

Why do I think so? Thinking about possible scenarios, where this PVC would be used I can only think of considerable caches with many, many (possible little) files (that is how git repositories looks like, thinking .git directories, indexes etc). And FS synchronization across nodes with such huge distributed disk might be really problematic. Now think of possible issues, that could occur as a side effect of it. The user has inconsistent git directories because of some consistency problem of underlying DFS.

Having this said, I'm not sure anymore, that this PR is a good idea. StatefulSet might be a better approach here.

I can still raise a PR, but I'm just afraid, that someone someday will have a big headache due to this.

pierluigilenoci commented 2 years ago

@docent-net sorry for the delay in answering.

The answer you are looking for is Statefulset. If you have multiple repo-server replicas you must also have multiple disks, one for each replica. So the solution for assigning 1:1 disks is Statefulset.

Using NFS is a problem of overcomplication and poor performance. From personal experience I can say that sometimes a cache on NFS is slower than re-loading everything from scratch over the network.

The alternative solution is to use emptyDir but it does not solve the node disk load problem. Unless you put the disk in memory but in this case the cost increases.

The perfect solution does not exist but something needs to be done.

aroundthecode commented 2 years ago

I've one solution to propose 1) default deploy behaviour is the current one with emptyDir

2) if a different volume definition is provided in values file then it's applied as-is, without any additional implementation, leaving final user the desired approach for volume management .

repoServer:
 existingVolumes:
    tmp_dir:
      persistentVolumeClaim:
        claimName: pvc-argocd-repo-server-tmp

I've performed such changes on my forked repo for details: https://github.com/argoproj/argo-helm/compare/main...aroundthecode:argo-helm:reposerver-pvc

With such apporach user can create PV/PVC outside helm and then attach to the deploy, or leave everything as is. It could be a quick fix allowing user to avoid diskpressure issue and going on discussing Statefulset migration

arielly-parussulo commented 1 year ago

Hello there! Do you intend to open a PR for this solution so we can have it in the oficial Helm Chart? I am trying to migrate an internal chart that we use to the official one and this is one of the features that we would need in order to achieve that. For now I will create a fork but I think that this config should be in the oficial Helm Chart as it is recommended by the High Availability doc.

docent-net commented 1 year ago

@aroundthecode

I've performed such changes on my forked repo for details: https://github.com/argoproj/argo-helm/compare/main...aroundthecode:argo-helm:reposerver-pvc

So, are you able to raise a PR for this, since you already have it worked out?

arielly-parussulo commented 1 year ago

I ended up creating a PR to add support to deploy argocd-repo-server as a StatefulSet with PVCs based on what we used in my company: https://github.com/argoproj/argo-helm/pull/1648 Feel free to add suggestions or editing it.

pierluigilenoci commented 1 year ago

@arielly-parussulo, your PR is a bit stranded...

aroundthecode commented 8 months ago

@aroundthecode

I've performed such changes on my forked repo for details: main...aroundthecode:argo-helm:reposerver-pvc

So, are you able to raise a PR for this, since you already have it worked out?

HI all, I've tried to perform an alternative PR #2410 with my approach on this, not sure if everthing is fine, please check and advise if needed

snuggie12 commented 8 months ago

@aroundthecode oddly enough I was probably about a week away from making a PR since we're about to start using the helm chart. Yours looks more comprehensive though. Thanks!

aroundthecode commented 8 months ago

@snuggie12 ( or anyone) can you please help me with the PR? it seems something is missing in the docs generation but I can't realize what.

I've added the helm-docs comments to my values.yaml section but is seems something is still missing since it's telling me the documentation is outdated.

snuggie12 commented 8 months ago

@snuggie12 ( or anyone) can you please help me with the PR? it seems something is missing in the docs generation but I can't realize what.

I've added the helm-docs comments to my values.yaml section but is seems something is still missing since it's telling me the documentation is outdated.

@aroundthecode It's running helm-docs in the check and then doing a git diff and seeing there's a change to be made. What it wants you to do is run helm-docs yourself and commit the updated README.md. According to this there should be a script which cleanly executes (it mentions a container so I think you'll need some sort of container runtime like docker or containerd/cri.) If that's an issue in your env you can always install the helm plugin yourself.

Or if you just want to cheat you can look at the diff in the GH action output and paste it in the README.md (mostly kidding as you should probably follow their instructions, but wanted to leave it as a last restort.)

argoproj / argo-helm

Argo CD: PersistentVolumeClaims for volumes #438