epiforecasts / EpiNow2

Estimate Realtime Case Counts and Time-varying Epidemiological Parameters
https://epiforecasts.io/EpiNow2/dev/
Other
112 stars 31 forks source link

repo size #538

Closed sbfnk closed 4 months ago

sbfnk commented 7 months ago

The repo has grown fairly large (~1 GB), but the files currently in the repo are only 11 MB in size. It might be nice, particularly towards those on low bandwidth connections or paying by volume, to look at reducing the size without losing any relevant development history.

Using git filter-repo --analyze reveals a few potential easy gains:

> cat filter-repo/analysis/directories-deleted-sizes.txt 
=== Deleted directories by reverse size ===
Format: unpacked size, packed size, date deleted, directory name
     6810551    3891751 2020-07-22 docs
     6341588    3835465 2020-07-22 docs/reference
     4357982    3715367 2020-07-22 docs/reference/figures
     3724958    1913831 2022-12-19 deps/bootstrap-5.1.3
     3118580    1835087 2022-12-19 deps/bootstrap-5.1.3/fonts
    23619299     747946 2023-02-03 src
     9092536     729112 2023-01-17 inst/pkg-structure
       53863       8984 2023-02-02 .devcontainer
       35864       5688 2023-02-02 .devcontainer/library-scripts
       26185       1104 2020-07-22 docs/news
           0         90 2022-10-15 tests/testthat/test-data
> head -n 10 filter-repo/analysis/path-deleted-sizes.txt 
=== Deleted paths by reverse accumulated size ===
Format: unpacked size, packed size, date deleted, path name(s)
    65775542   64539999 2020-11-30 synthetic.rds
     5907877    5236611 2020-11-08 man/figures/unnamed-chunk-11-1.png
     5207701    4641809 2021-06-03 reference/figures/unnamed-chunk-11-1.png
     3976705    3530160 2020-11-08 man/figures/unnamed-chunk-12-1.png
     3776311    3362607 2020-07-22 docs/reference/figures/unnamed-chunk-11-1.png
     3552356    3153533 2021-06-03 reference/figures/unnamed-chunk-12-1.png
     3143380    3080722 2023-10-03 data/example_regional_epinow.rda
     1619219    1504604 2021-06-03 reference/epinow-5.png

At the very least this suggests to me that all the directories above, as well as all png files in man/figures (which, if I understand correctly, aren't used anywhere) could be purged. A line to exclude png files in man/figures could also be added to .gitignore. This could be followed by a deeper investigation of blob sizes for existing files.

seabbs commented 7 months ago

yes definitely agree. Certainly the main culprits (docs, deps, and src). Agree we could remove prior figures from the old readme as well

sbfnk commented 7 months ago

Running

> git filter-repo \
  --path src/ \
  --path deps/ \
  --path dev/ \
  --path reference/ \
  --path synthetic.rds \
  --path data/example_regional_epinow.rda \
  --path data/example_estimate_infections.rda \
  --path-regex man/figures/unnamed-chunk-\[0-9\]+-1\\.png \
  --path-regex inst/dev/figs/.\*scores\\.png \
  --invert-paths

reduces the size of the repo from 1.1GB to 34MB. Any objections to going ahead with it? I could create a backup fork in my personal account first.

Given that this would require a force push anyone who has the repo checked out locally will have to do a git reset at some point. I don't think there's a way around this - the alternative is to keep things as they are. On balance I'd think it's worth it but if anyone disagrees please leave a comment.

jamesmbaazam commented 7 months ago

I'm not sure of the cons, so I'd say go ahead. It's good that you're keeping a backup just in case.

Bisaloo commented 7 months ago

I agree this is necessary but highlighting some important caveats we discovered with @ntorresd when going through the same process with serofoi:

@ntorresd, did I forget anything?

ntorresd commented 7 months ago

I would only add that you will not see the effects of the clean up until the clean versions of the git tags had been pushed. When we did this with @Bisaloo for serofoi we didn't see the change reflected on fresh copies of the repository until we ran git push origin v0.0.9 -f on my local cleaned copy.

Bisaloo commented 7 months ago

Thanks, I had forgotten about the tags.

I wonder about the impact of all of this on renv.lock lockfiles since it stores a hash of the source :thinking:

sbfnk commented 7 months ago

Thanks all for the helpful comments. To confirm I will:

which should address all the points raised above, unless I've forgotten something.

Bisaloo commented 7 months ago

Yes, this seems right.

To be 100% clear because a previous version of my message wasn't: from what we've seen in serofoi, I don't think you'll be able to reopen closed PRs. You will have to create new ones. No issues from a git point of view, but conversation will be spread across two PRs.

sbfnk commented 7 months ago

Ah ok probably worth waiting for currently open ones to be merged then.

sbfnk commented 4 months ago

To do before 1.5 release

sbfnk commented 4 months ago

~I've done the steps outlined above and the force push succeeded - old refs are still there and PRs still open though, so not sure if I'm missing a step or if it's a matter of waiting for repacking.~ see next comment

sbfnk commented 4 months ago

Upon closer inspection the vast majority of the repo content was in the gh-pages branch so I've done a big squash there has reduced the size to manageable levels (1.1 GB -> 100MB).