argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.85k stars 5.45k forks source link

Shallow clone option for reposerver? #5467

Closed askreet closed 4 weeks ago

askreet commented 3 years ago

Summary

I'm wondering if the team would be interested in having a feature that allows the reposerver to shallow clone the target repository, to an optional depth value. We have a use case that is probably abnormal in that we are provisioning entire environments with their own argocd deployment, and then deploying a fleet of applications. Some of the git repositories have quite a lot of history and exceed the existing exec timeout (which we have had to extend). Using a shallow clone (i.e. --depth=XXX) would drastically improve initialization time for us. We may be able to dedicate some resources to the implementation if the feature is desirable.

Motivation

Deploying argocd often in short-lived environments results in long wait times for initial git clones.

Proposal

Introduce a tunable option for shallow clones, either as an environment variable or an annotation (which is preferred?).

alexmt commented 3 years ago

I think can be optionally enabled per repository. Apparently, the shallow clone is expensive. See https://github.com/Homebrew/discussions/discussions/225

askreet commented 3 years ago

@alexmt It puts burden on the server side to compute the bundle that is sent over the wire, if I understand correctly. In Homebrew's case, that's probably a noticeable impact to GitHub (where Homebrew uses GitHub as a backing repository server, essentially, and has a huge commit history).

When you said "I think it can be optionally enabled per repository", you mean if it were to be implemented, right? I haven't missed some feature that already exists, I hope. This would be ideal for us, as most of our repositories are small, but some have huge history where --depth=1 would help greatly.

alexmt commented 3 years ago

Thank you for clarification @askreet !

I was trying to say that the shallow clone should not be the default behavior. This is not supported yet. We would have to introduce e.g. fetchDepth field in repository settings and use it in git client:

https://github.com/argoproj/argo-cd/blob/fb8096a1f7f3dcca51eff3018288708ce523da7e/util/git/client.go#L264

askreet commented 3 years ago

Was doing some experimentation with this this morning and realized that if this does get implemented, it's more involved than simply passing --depth=1 to the initial clone. As soon as you call git fetch origin --tags --force on the new repository, it fetches quite a lot of the historical state you've avoided with --depth=1, assuming you have tags dating back to the beginning of the repository.

It could be possible that with fetchDepth set we skip fetching tags or some other workarounds.

jannfis commented 3 years ago

It could be possible that with fetchDepth set we skip fetching tags or some other workarounds.

I think that's worth a try. With limiting the fetch depth, user explicitly agrees to ignore history of the repo, so we won't require any of it imo.

jannfis commented 3 years ago

@alexmt Regarding the new parallellism features with repositories, I understood that we prevent a new clone of the repository for applications. Does it take the target revision into account, or will it switch between target revisions in our checked out repository on the repo-server?

I haven't looked, but if the latter is true, I think with shallow clones, parallellism would have to be disabled by being tested for here as well:

https://github.com/argoproj/argo-cd/blob/824ff732a2a34874c24fff4af1babc382ecb765a/pkg/apis/application/v1alpha1/types.go#L164-L174

phs commented 3 years ago

Popped in to ask for this exact feature.

I was testing manual clones against my own large repo, to understand what I would like argo to even do. I noticed that while initial checkouts with (say) --depth 1 worked well, I could not then git pull --depth 1, without running into fatal: refusing to merge unrelated histories errors. Likewise explicitly allowing unrelated histories did not help.

Instead, I found that I was required to git fetch --depth 1 and then git reset --hard origin/the/branch to move from one shallow checkout to the next. Which given the context is completely acceptable to me. I don't know offhand how argo performs its pulls/merges, but I would hope the hard-reset option remains on the table.

dodwmd commented 2 years ago

bumping this request as we're running into fetch-pack: invalid index-pack output issues

prein commented 2 years ago

Would git clone [remote-url] --branch [name] --single-branch be worth considering in addition to --depth 1?

imtiazc commented 1 year ago

As our repo started growing, we are running into issues where the argocd-repo-server is taking too long to fetch the contents and are timing out. level=error msg="finished unary call with code Internal" error="rpc error: code = Internal desc = Failed to fetch default:git fetch origin --tags --forcefailed timeout after 1m30s" grpc.code=Internal grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2023-02-14T23:22:08Z" grpc.time_ms=90007.375 span.kind=server system=grpc

nuno-silva commented 1 year ago

+1 on supporting shallow clone :pray:

emirot commented 1 year ago

I would like to know how that could be implemented ?

Should it be a command argument or an env variable on the repo server that just update that function https://github.com/argoproj/argo-cd/blob/master/util/git/client.go#L334

Or when adding a repository by specifying the depth? So having the option for each repo?

Or it needs to be added at the application level with a new depth field?

    ....
    spec:
      project: default
      source:
        repoURL: https://github.com/argoproj/argo-cd.git
        targetRevision: HEAD
        path: applicationset/examples/list-generator/guestbook/{{cluster}}
        depth: 5
      destination:
        ....
dodwmd commented 1 year ago

As applications function as pipelines of sorts, I'd think it would be on the application level. it would possibly mean repos would need to be checked out multiple times per each application. if they were on different shas.

QuinnBast commented 1 year ago

Would love to see support for shallow clones or some way to speed up/improve the clone process. ArgoCD should only really ever need the latest commit (or a single SHA) of a branch. In either case a git clone --depth 1 should work to just be able to fetch the current files in the repository/branch. ArgoCD just runs kustomize/kubectl apply/helm commands so I'm not sure if there is any need for ArgoCD to keep/fetch any git history anyway; if we need a different SHA/branch we will just update the Argo Application to point to a different targetRevision. Not to mention, improving clone speed would also speed-up the sync process, even for non-large repos :)

Something else that could be helpful (or an alternative) is the ability to filter and only clone/download specific directories. We tend to keep all of our infrastructure/k8s files in a separate folder, and ArgoCD is never going to need to be aware of any of the source code for the app. I think that currently spec.source.path just cd's into that path but I tend to only need files in or below that folder anyway. A quick wget or curl on, for example, http://<repositoryUrl>/-/tree/<targetRevision>/<sourcePath> could be an extremely quick alternative that would bypass cloning a git repository.

For now, since it takes forever to clone our main repository, our workaround is to create a new git repository for argo/infrastructure and link it as a git submodule from our main repository. However, we would only run argo against the submodule... though ideally we would remove the submodule completely, as this now requires us to manage MRs in two repositories, worrying about submodules, etc...

blakepettersson commented 1 year ago

Something like this is being worked on in #14272

batazor commented 3 months ago

any news?

crenshaw-dev commented 1 month ago

Copying from here:

Sparse and shallow git repos are common requests for monorepos. People are accustomed to using these features in CI pipelines. But these settings are much easier to get right in CI pipelines because you throw away the clone when you're done with it. Argo CD maintains a persistent clone on the repo-server, allowing concurrent access to the same clone until the repo-server restarts. When managing a persistent clone, you have to handle cases which would be safe to ignore for a throw-away clone.

For example:

1) What happens if two applications with different depths/paths access the same repo at the same time? 1) If I change one of these settings, when does it reflect? Immediately? On next checkout? 1) Is storage efficiency impacted? i.e. is the size of my clone going to blow up over time? 1) Is CPU use impacted? Will I see CPU spikes due to cleanup processes? 1) Is there a need for manual cleanups? When do you call them? Are there concurrency concerns when calling them?

The concurrency concerns are different depending on whether you configure depth/paths at the app level or the repo level. If at the repo level, you're less likely to encounter races, but it's still possible.

Due to Argo CD's persistent repo cache, this feature can't just be implemented, it has to be carefully designed. Specifically, the design needs to answer the questions above.

crenshaw-dev commented 4 weeks ago

Consolidating conversation here: https://github.com/argoproj/argo-cd/issues/11198