codeforamerica / projectmonitor

ProjectMonitor is a CI display aggregator. It displays the status of multiple Continuous Integration builds on a single web page.
cfa-project-monitor.herokuapp.com
17 stars 8 forks source link

Project repo is huuuuuge. Lose the history? #13

Closed ondrae closed 9 years ago

ondrae commented 9 years ago

I keep trying to clone it, and only getting to 87% on this cafe wifi. I had to download just the master branch as a .zip file without the history.

Since we've rewritten it anyways, can we separate it from the forked pivotal version?

migurski commented 9 years ago

Have you tried the --single-branch option for git clone? It’ll ignore anything other than the one branch you’re asking for.

monfresh commented 9 years ago

I think this might be worth re-opening. At 18F, we'd like to be able to send pull requests to you, so we want to fork the repo via GitHub. However, by forking it, we assume the load of the current repo, and we'd prefer our repo not to be 43MB in size. The culprit is a 42MB .pack file in .git/objects/pack.

monfresh commented 9 years ago

Looks like it's a bad pack file:

git verify-pack -v .git/objects/pack/pack-63968d00c2176e05298b52a129572aee5991630d.idx

fatal: Cannot open existing pack file '.git/objects/pack/pack-63968d00c2176e05298b52a129572aee5991630d.idx'
.git/objects/pack/pack-63968d00c2176e05298b52a129572aee5991630d.pack: bad
monfresh commented 9 years ago

Nevermind. I pasted the wrong pack.

To see the 10 biggest files, run this:

git verify-pack -v .git/objects/pack/pack-e0dc5715594689368b1d28eeff86930591cc5d7f.idx \
| sort -k 3 -n \
| tail -10

To see what each file is, run this:

git rev-list --objects --all | grep [first few chars of the sha1 from previous output]

You will notice that all the files are either .gem or .jar. The next step would be to clean up your git by removing all of those unnecessary files.

One option is to use the bfg-repo-cleaner tool, which worked great for me, and was super fast.

Alternatively, you could do it manually following this git article, as outlined below:

git filter-branch --index-filter 'git rm --cached --ignore-unmatch *.gem' -- --all
rm -Rf .git/refs/original
rm -Rf .git/logs/
git gc --aggressive --prune=now

Then repeat with .jar files:

git filter-branch --index-filter 'git rm --cached --ignore-unmatch *.jar' -- --all
rm -Rf .git/refs/original
rm -Rf .git/logs/
git gc --aggressive --prune=now

Then verify:

git count-objects -v

Your size-pack should be a lot smaller now.

monfresh commented 9 years ago

I should also note that the bfg-repo-cleaner tool will clean out more than just .gem and .jar files. If you use the command listed in their Usage section (java -jar bfg.jar --strip-biggest-blobs 500 some-big-repo.git), it will clean out the 500 biggest files. When I looked through the log, there were a bunch of .yml and .rb files there as well, which are obviously not needed anymore.

migurski commented 9 years ago

Does Github actually count those 42MB? Is it unworkable for you to check out the single master branch?

Long term, we’d probably just trash the last-rails-version release, which shares no history with the current master.

shawnbot commented 9 years ago

Isn't there only one branch, though? I vote for pruning the ruby stuff using @monfresh's approach.

monfresh commented 9 years ago

@migurski Yes, when you clone this repo (or any fork of it), git downloads all 43MB of it.

This is not that big of deal since we have access to decent internet most of the time. For now, having to wait a little longer to clone the repo is better than having to use single branch mode IMO.

Consider this scenario:

  1. I fork this repo using GitHub's fork button, which doesn't allow you to specify what part of the repo to fork. Hence, you get all 43MB of it.
  2. I clone my fork on my machine using the --single-branch option. This gives me the lean version.
  3. I create a feature branch on my machine and send a PR to our fork.
  4. One of my teammates wants to checkout this new branch on her machine so she can collaborate on it with me. The only way for her to checkout this new branch is to either clone the entire repo or to remember to use the --single-branch option, in addition to specifying the branch name she wants to work on. That can get annoying.
migurski commented 9 years ago

I ran the two filter-branch commands, and the size looks to be 3.41MB.

monfresh commented 9 years ago

Sounds about right to me. Thanks!