blaylockbk / Herbie

Download numerical weather prediction datasets (HRRR, RAP, GFS, IFS, etc.) from NOMADS, NODD partners (Amazon, Google, Microsoft), ECMWF open data, and the University of Utah Pando Archive System.
https://herbie.readthedocs.io/
MIT License
485 stars 73 forks source link

Mitigate unusual large repository size #141

Open amotl opened 1 year ago

amotl commented 1 year ago

Hi again,

related to GH-140, because the rendered HTML documentation has been committed to the repository itself ^1, it weighs in with an unusual large repository size of 114 MB, making Git operations take more time and transfer bandwidth than necessary. While the matter would be resolved with GH-140, it does not shrink the repository retroactively.

So, while it does break eventual forks, I would strongly recommend to edit the repository history and remove this large chunk of content, by using a tool like BFG Repo-Cleaner.

If you agree on that, I can help you implementing the necessary steps.

With kind regards, Andreas.

amotl commented 1 year ago

About

Using the steps outlined below, I've shrinked the repository and uploaded it to https://github.com/amotl/herbie-without-docs, in order to demonstrate it. Both download times, bandwidth-, and disk-usage will decrease significantly.

Shrink the repository size

git clone --mirror https://github.com/blaylockbk/Herbie.git
cd Herbie.git
bfg --delete-folders _build --no-blob-protection .
git reflog expire --expire=now --all && git gc --prune=now --aggressive

Before

time git clone https://github.com/blaylockbk/Herbie
real    0m21.635s
user    0m4.099s
sys 0m1.933s

du -sch Herbie
355M    total

After

time git clone https://github.com/amotl/herbie-without-docs.git
real    0m9.420s
user    0m1.643s
sys 0m0.854s

du -sch herbie-without-docs/
 84M    total
blaylockbk commented 1 year ago

I like the idea of cleaning up the unnecessary /_build directory; sounds like if I keep doing the same, the repo will just keep getting bigger.

I found this tool that does the same thing (it's installable with conda and has docs that show how to convert a command from BRG) https://github.com/newren/git-filter-repo/

conda install -c conda-forge git-filter-repo
git clone https://github.com/blaylockbk/Herbie.git
cd Herbie
git filter-repo --invert-paths --path-glob '*/_build'

Before doing this, I'd like to better understand the implications of rewriting the git history. In your example, it looks like the tags and releases are lost. What else is lost?

And to be clear, instead of keeping the rendered docs in the Herbie repo, the rendered docs will be stored on readthedocs servers. So, I won't need to manually make html; as long as the build works on readthedocs, they make the docs for each pull request and merge. Correct?

amotl commented 1 year ago

The rendered docs will be stored on readthedocs servers. They make the docs for each pull request and merge. Correct?

Correct!

Before doing this, I'd like to better understand the implications of rewriting the git history. In your example, it looks like the tags and releases are lost. What else is lost?

Oh, that might be the case. Well, it would be a bit sad, but there is probably no way around it. Maybe let's research this detail a bit more beforehand?