arup-group / elara

Command line utility for processing MATSim events output files.
MIT License
13 stars 4 forks source link

Code Review in Preparation for FOSS #191

Closed andkay closed 2 years ago

andkay commented 2 years ago

Opening an issue for discussion.

Code quality and documentation is obviously number one, but also a random thought.

The repo is quite large 680.71 MiB. I suspect this is related to history of Test Fixtures and Test Outputs. Would be good to BFG this and get a strategy in place so this doesn't explode again.

fredshone commented 2 years ago

agree - options: (1) shallow clone - https://stackoverflow.com/questions/4515580/how-do-i-remove-the-old-history-from-a-git-repository (2) remove based on file size using BFG (3) remove all history (4)?

I like 1 - but not 100% sure of consequences

mfitz commented 2 years ago

I'm assuming you hit the button by mistake, rather than you're telling us there are no options @fredshone ?

andkay commented 2 years ago

I've been using shallow clones on workboxes -- which indeed work, but I understand they have some unintended consequences, and recommending it for installation feels weird (never seen a repo that says to do this).

I think we would need to BFG based on file names/extensions instead of sizes. None of the data is particularly large -- its just that many commits have updated many small files many times.

I'm interested in finding a more permanent solution. So long as we continue to push refreshed test data to the repo, this is going to come up again I think 🤔

mfitz commented 2 years ago

Some detail on the repo size...

Repo is definitely excessively chonky at nearly 700 megs:

:~/workspace/elara(elar-3.7.6)$ du -hs
691M    .

The Git pack file is the source of the chonkiness, as we suspected:

$ du | sort -nr

1416120 .
1399480 ./.git
1398800 ./.git/objects
1398688 ./.git/objects/pack
6488    ./tests
4288    ./tests/test_intermediate_data
3288    ./tests/test_intermediate_data/benchmarks
1288    ./elara
744     ./tests/test_outputs
568     ./elara/__pycache__
408     ./tests/test_fixtures
360     ./tests/__pycache__
304     ./example_benchmark_data
224     ./.git/logs
216     ./.git/refs
216     ./.git/logs/refs
192     ./images
184     ./example_benchmark_data/test_fixtures
176     ./.git/refs/remotes/origin
176     ./.git/refs/remotes
176     ./.git/logs/refs/remotes/origin
176     ./.git/logs/refs/remotes
136     ./tests/test_outputs/benchmarks
96      ./example_benchmark_data/test_town
88      ./.git/hooks
80      ./scripts
...

Looks like some MATSim events files were accidentally checked in at some point - see the tests/londinium_hermes_10iters_msonly/ dir - and have since been removed, but they still exist in the pack file and seem to be responsible for most of the chonk:

$ git rev-list --objects --all |   git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |   sed -n 's/^blob //p' |   sort --numeric-sort --key=2 |   cut -c 1-12,41- |   $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest | grep -vF --file=<(git ls-tree -r HEAD | awk '{print $3}')

...
a52f7b6d4d54  6.8MiB tests/test_fixtures/output_plans.xml
402e10ec2770   11MiB tests/test_fixtures/output_network.xml.gz
3bbec54f38e9   22MiB tests/test_fixtures/output_events.xml
b746fe957c2b   46MiB tests/test_fixtures/output_transitSchedule.xml
ce672c79a39d   58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.9/9.events.xml.gz
000c11f17633   58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.10/10.events.xml.gz
65b0fbaa53b1   58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.8/8.events.xml.gz
cbee81aa6e65   58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.5/5.events.xml.gz
3466c31b35ce   58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.6/6.events.xml.gz
fddf435dc967   58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.7/7.events.xml.gz
db280895d97c   58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.4/4.events.xml.gz
0ad6859f424b   58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.3/3.events.xml.gz
b8898eb5d836   58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.2/2.events.xml.gz
d038bccd9080   58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.1/1.events.xml.gz
75a06db42166   58MiB tests/londinium_hermes_10iters_msonly/ITERS/it.0/0.events.xml.gz

I will purge these files from the pack file and see how the repo size looks - should be a nice slimming.