Closed andkay closed 2 years ago
agree - options: (1) shallow clone - https://stackoverflow.com/questions/4515580/how-do-i-remove-the-old-history-from-a-git-repository (2) remove based on file size using BFG (3) remove all history (4)?
I like 1 - but not 100% sure of consequences
I'm assuming you hit the button by mistake, rather than you're telling us there are no options @fredshone ?
I've been using shallow clones on workboxes -- which indeed work, but I understand they have some unintended consequences, and recommending it for installation feels weird (never seen a repo that says to do this).
I think we would need to BFG based on file names/extensions instead of sizes. None of the data is particularly large -- its just that many commits have updated many small files many times.
I'm interested in finding a more permanent solution. So long as we continue to push refreshed test data to the repo, this is going to come up again I think 🤔
Some detail on the repo size...
Repo is definitely excessively chonky at nearly 700 megs:
:~/workspace/elara(elar-3.7.6)$ du -hs
691M .
The Git pack file is the source of the chonkiness, as we suspected:
$ du | sort -nr
1416120 .
1399480 ./.git
1398800 ./.git/objects
1398688 ./.git/objects/pack
6488 ./tests
4288 ./tests/test_intermediate_data
3288 ./tests/test_intermediate_data/benchmarks
1288 ./elara
744 ./tests/test_outputs
568 ./elara/__pycache__
408 ./tests/test_fixtures
360 ./tests/__pycache__
304 ./example_benchmark_data
224 ./.git/logs
216 ./.git/refs
216 ./.git/logs/refs
192 ./images
184 ./example_benchmark_data/test_fixtures
176 ./.git/refs/remotes/origin
176 ./.git/refs/remotes
176 ./.git/logs/refs/remotes/origin
176 ./.git/logs/refs/remotes
136 ./tests/test_outputs/benchmarks
96 ./example_benchmark_data/test_town
88 ./.git/hooks
80 ./scripts
...
Looks like some MATSim events files were accidentally checked in at some point - see the tests/londinium_hermes_10iters_msonly/
dir - and have since been removed, but they still exist in the pack file and seem to be responsible for most of the chonk:
$ git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sed -n 's/^blob //p' | sort --numeric-sort --key=2 | cut -c 1-12,41- | $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest | grep -vF --file=<(git ls-tree -r HEAD | awk '{print $3}')
...
a52f7b6d4d54 6.8MiB tests/test_fixtures/output_plans.xml
402e10ec2770 11MiB tests/test_fixtures/output_network.xml.gz
3bbec54f38e9 22MiB tests/test_fixtures/output_events.xml
b746fe957c2b 46MiB tests/test_fixtures/output_transitSchedule.xml
ce672c79a39d 58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.9/9.events.xml.gz
000c11f17633 58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.10/10.events.xml.gz
65b0fbaa53b1 58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.8/8.events.xml.gz
cbee81aa6e65 58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.5/5.events.xml.gz
3466c31b35ce 58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.6/6.events.xml.gz
fddf435dc967 58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.7/7.events.xml.gz
db280895d97c 58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.4/4.events.xml.gz
0ad6859f424b 58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.3/3.events.xml.gz
b8898eb5d836 58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.2/2.events.xml.gz
d038bccd9080 58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.1/1.events.xml.gz
75a06db42166 58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.0/0.events.xml.gz
I will purge these files from the pack file and see how the repo size looks - should be a nice slimming.
Opening an issue for discussion.
Code quality and documentation is obviously number one, but also a random thought.
The repo is quite large
680.71 MiB
. I suspect this is related to history of Test Fixtures and Test Outputs. Would be good to BFG this and get a strategy in place so this doesn't explode again.