argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.47k stars 5.31k forks source link

repo-server duplicates git packfiles, filling up disk #8845

Closed crenshaw-dev closed 2 years ago

crenshaw-dev commented 2 years ago

Describe the bug

In version 2.3.1, each time repo-server fetches a new commit from a repo which has a packfile, the packfile is duplicated. So N commits means N packfiles, when git should only have one packfile. If the file is big, this can fill up the repo-server disk.

This doesn't happen in 2.2.7.

To Reproduce

Create an app using a repo which has a packfile. I've been using a repo which has ~70k commits. When I clone that locally, I can see that there's a packfile in .git/objects/packs.

Remote into the repo-server pod and list the files in /tmp/_argocd-repo//.git/objects/head. You'll have to chmod +rx some directories to get access. There should be one .idx and one .pack file.

Push a new commit to the repo and do a hard refresh on the app. List pack files again, and you'll see an additional .idx and an additional .pack file.

Expected behavior

I expected git to maintain one pack file.

Version

v2.3.1

Logs

I've manually added --verbose and GIT_TRACE=1 to the git calls. There's nothing interesting in the logs as far as I can tell.

I've also commented out the initializer and closer logic that sets repo directory permissions as well as manually setting the _argocd-repo permissions to rwx. No effect.

Finally I've tried downgrading git to 2.30.2 by building an image based on Ubuntu 21.04. Same bug.

I'm out of hunches.

crenshaw-dev commented 2 years ago

Reproduced on master in 9d4ed2847, so https://github.com/argoproj/argo-cd/pull/8517 is not to blame.

Reproduced on master in f364330de, so https://github.com/argoproj/argo-cd/commit/8139df898339c6eb497c072af4150c783ba9ba57 is not to blame.

Looks like the problem goes pretty far back in the changes that were made to 2.3.

crenshaw-dev commented 2 years ago

Failed to reproduce in 6abccea3f. Reproduced in 4aa614daf.

Those are adjacent in master. https://github.com/argoproj/argo-cd/pull/5605 appears to be the culprit!

crenshaw-dev commented 2 years ago

The code itself is a little difficult to follow. Here's the important part.

    err = gitClient.Init()

    // the old way
    //err = gitClient.Fetch("")
    //err = gitClient.Checkout(revision, false)
    // push a commit here
    //err = gitClient.Fetch("")
    //err = gitClient.Checkout(revision, false)
    // observe that the pack file has not been duplicated

    // the new way
    err = gitClient.Fetch("some-revision")
    err = gitClient.Checkout("FETCH_HEAD", false)
    // push a commit here
    err = gitClient.Fetch("some-revision")
    err = gitClient.Checkout("FETCH_HEAD", false)
    // observe that the pack file HAS been duplicated.

Basically if you comment out the new way and uncomment the old way, you won't observe the duplicated pack files.

I've created a way to reproduce the bug with just git:

/tmp $ mkdir argo-cd
/tmp $ cd argo-cd
/t/argo-cd $ git init
hint: Using 'master' as the name for the initial branch. This default branch name
hint: is subject to change. To configure the initial branch name to use in all
hint: of your new repositories, which will suppress this warning, call:
hint: 
hint:   git config --global init.defaultBranch <name>
hint: 
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
hint: 'development'. The just-created branch can be renamed via this command:
hint: 
hint:   git branch -m <name>
Initialized empty Git repository in /private/tmp/argo-cd/.git/
/t/argo-cd (master|✔) $ git remote add origin https://github.com/argoproj/argo-cd.git
/t/argo-cd (master|✔) $ git fetch origin 32be020af0f8bf6438201ee79b4d2b8037c57154 --tags --force
remote: Enumerating objects: 54420, done.
remote: Counting objects: 100% (9/9), done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 54420 (delta 1), reused 6 (delta 0), pack-reused 54411
Receiving objects: 100% (54420/54420), 43.14 MiB | 5.11 MiB/s, done.
Resolving deltas: 100% (36561/36561), done.
From https://github.com/argoproj/argo-cd
 * branch              32be020af0f8bf6438201ee79b4d2b8037c57154 -> FETCH_HEAD
 * [new tag]           stable        -> stable
... removed a bunch of tags that were here ...
/t/argo-cd (master|✔) $ ls .git/objects/pack
pack-2dedc829f2924a0dba7d1b99026a1c27c0a61fd2.idx  pack-2dedc829f2924a0dba7d1b99026a1c27c0a61fd2.pack
/t/argo-cd (master|✔) $ git fetch origin 32d33dedcc70d94177384b235891b99d89497273 --tags --force
remote: Enumerating objects: 1279, done.
remote: Counting objects: 100% (686/686), done.
remote: Compressing objects: 100% (9/9), done.
remote: Total 1279 (delta 678), reused 681 (delta 676), pack-reused 593
Receiving objects: 100% (1279/1279), 3.39 MiB | 2.03 MiB/s, done.
Resolving deltas: 100% (858/858), completed with 266 local objects.
From https://github.com/argoproj/argo-cd
 * branch              32d33dedcc70d94177384b235891b99d89497273 -> FETCH_HEAD
/t/argo-cd (master|✔) $ ls .git/objects/pack
pack-2dedc829f2924a0dba7d1b99026a1c27c0a61fd2.idx  pack-2dedc829f2924a0dba7d1b99026a1c27c0a61fd2.pack pack-b7c2bb5a0166eb4d9ac04c4214b296cc82242e68.idx  pack-b7c2bb5a0166eb4d9ac04c4214b296cc82242e68.pack
/t/argo-cd (master|✔) $ 

I'm not sure what this teaches us.

crenshaw-dev commented 2 years ago

git gc helps. But I'd rather prevent the problem than spend in-request time cleaning it up.

crenshaw-dev commented 2 years ago

Okay. So when you run git fetch, git looks at its config for the "refspec" for the given remote. Based on that refspec, it will get a bunch of commit SHAs corresponding to refs which match that refspec. It seems the default refspec is refs/heads/*:refs/remotes/origin/*. So git fetch origin won't get other types of refs, like refs/pulls/123/head.

git does however support fetching a specific SHA. That will resolve even refs which are not in the default refspec. So when a user specifies refs/pulls/123/head, Argo CD gets the SHA corresponding to that ref, and git ref origin <SHA> pulls down that specific ref.

Unfortunately, it seems like git isn't very tidy when you fetch specific SHAs like that.

So here is my proposal. Instead of defaulting to fetching specific commits, first just do a standard git fetch origin and try to check out the given revision. If the checkout fails, then try to check out the specific SHA.

For users of non-standard refs, this causes a performance hit, because you run git fetch origin and then git fetch origin <SHA>. I'm going to assert that those users are in the minority and that the performance hit is tolerable.

For users of standard refs, this improves disk usage, because git fetch origin without the SHA is less messy. I'll assert a couple reasons this is worth it: 1) disk usage is more important than latency - doubling response time for checkoutRevision for a small number of users isn't as bad as filling up the repo-server disk 2) the number of people with disk-usage-impacted repos is larger than the number of people using non-standard refs - I think this is a reasonably safe assumption, because many gitops repos have automated commits and therefore a lot of commits (meaning there are pack files which are likely to be duplicated after a git fetch origin <SHA>)

Will put up a PR tomorrow.

crenshaw-dev commented 2 years ago

Another demonstration:

mkdir argo-cd
cd argo-cd/
git init
git remote add origin https://github.com/argoproj/argo-cd.git
git fetch origin
git checkout 497e53b0203638409e3083fa2ffac7d8fb3cce14
git fetch origin
git checkout 32be020af0f8bf6438201ee79b4d2b8037c57154
git fetch origin
git checkout 32d33dedcc70d94177384b235891b99d89497273
git fetch origin
git checkout 2e65b42f05bcc1401d1489e751993ec197f6942c
git fetch origin
git checkout b1ff9dbe1e3e3b2520e94eefc77d0322c765cd75
ls .git/objects/pack  # shows two files
du -h .  # current directory is 96M
cd ..
mkdir argo-cd-fetch
cd argo-cd-fetch/
git init
git remote add origin https://github.com/argoproj/argo-cd.git
git checkout FETCH_HEAD
git fetch origin 497e53b0203638409e3083fa2ffac7d8fb3cce14
git checkout FETCH_HEAD
git fetch origin 32be020af0f8bf6438201ee79b4d2b8037c57154
git checkout FETCH_HEAD
git fetch origin 32d33dedcc70d94177384b235891b99d89497273
git checkout FETCH_HEAD
git fetch origin 2e65b42f05bcc1401d1489e751993ec197f6942c
git checkout FETCH_HEAD
git fetch origin b1ff9dbe1e3e3b2520e94eefc77d0322c765cd75
git checkout FETCH_HEAD
ls .git/objects/pack. # shows ten files
du -sh .  # current directory is 244M
crenshaw-dev commented 2 years ago

I asked on StackOverflow why the packfile behavior is so different and got a really interesting answer: https://stackoverflow.com/questions/71618307/why-would-fetching-specific-git-commits-use-more-disk-space-than-fetching-all

alexmt commented 2 years ago

Looks like this is a pretty bad regression. We should cherry-pick fix into v2.3