IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 103 forks source link

Reduce size of R-Instat GitHub repository from 4.2 GB to <1 GB #5905

Open lloyddewit opened 4 years ago

lloyddewit commented 4 years ago

The current R-Instat GitHub repository (repo) is currently around 4.2 GB. This means that the repo can only be reliably cloned if the developer has a good Internet connection. Some of the interns are currently unable to clone the repo.

The repo is currently also too large to clone internally within GitHub. This is because GitHub will not clone a repo that has files larger than 100 MB. The R-Instat repo has packed object files (used by Git internally) that are larger than 100 MB. This prevents us making a cloned repo in GitHub for backup or experimentation.

The large repo size is caused by data files that have since been deleted. If a file is deleted then GitHub still stores all previous versions of it internally.

If we remove all deleted data files then the repo will reduce in size from 4.2 GB to 0.14 GB (a 96% reduction).

I tested with the following commands in Git Bash:

git clone --mirror https://github.com/africanmathsinitiative/R-Instat
du -hs R-Instat.git
java -jar bfg-1.13.0.jar --delete-files "*.{zip,gz,rds,RDS,csv,xlsx}" R-Instat.git
cd R-Instat.git
git reflog expire --expire=now --all && git gc --prune=now --aggressive
du -hs .
git push --mirror https://github.com/lloyddewit/test2007171759.git

Notes:

$ find . -iname '*.csv' -print0 | du --files0-from - -c -hs |tail -1 71M total

$ find . -iname '*.xlsx' -print0 | du --files0-from - -c -hs |tail -1 7.0M total


- I think it would be good practice to remove all data files from the repo and host them somewhere else.
- The repo contains other deleted_files. We could remove these also but I think it would only have a negligible effect on the repo size.

**Next steps**
@dannyparsons As you know, changing the Git history is inherently risky and we may make a mistake or cause unintended side effects. 
- Please could you review the steps above?
- Do you have any extra ideas on how to reduce the risk (e.g. additional backups)?
- If we completely remove deleted files from the repo then does this prevent us building older versions if we ever need to do this?
- Is it safe to remove all deleted files with type `*.{zip,gz,rds,RDS,csv,xlsx}` or should we just remove `*.{zip,gz}`?
- If Interns cannot clone the current large repo then should they try and clone the cleaned repo [here](https://github.com/lloyddewit/test2007171759.git)? they could practice fixing issues and doing PRs on this repo. But they would need to reclone and resubmit all PRs when the official clean repo is available.
chrismclarke commented 1 year ago

I've been looking at similar issues (although much smaller scale) on another IDEMS repo.

One thing I would recommend before taking any other action would be to recommend to any users to clone with the --depth 1 flag for anyone that does not need the full git history, e.g.

git clone --depth 1 https://github.com/africanmathsinitiative/R-Instat.git
remote: Enumerating objects: 1880, done.
remote: Counting objects: 100% (1880/1880), done.
remote: Compressing objects: 100% (1352/1352), done.
remote: Total 1880 (delta 986), reused 927 (delta 476), pack-reused 0
Receiving objects: 100% (1880/1880), 66.11 MiB | 3.26 MiB/s, done.
Resolving deltas: 100% (986/986), done.
Updating files: 100% (2214/2214), done.

66MB received when compared to full clone 4.47GB

remote: Enumerating objects: 191512, done.
remote: Counting objects: 100% (1279/1279), done.
remote: Compressing objects: 100% (453/453), done.
remote: Total 191512 (delta 888), reused 1110 (delta 806), pack-reused 190233
Receiving objects: 100% (191512/191512), 4.47 GiB | 4.11 MiB/s, done.
Resolving deltas: 100% (130765/130765), done.

I've used git BFG a lot in the past and it's really good for deleting all files with a given extension, although if you are unsure whether there might still be some files needed you could also look at using Git Filter Repo which has some nice analysis tools and can remove based on filepath names.

As for risks, ideally a backup should be taken just in case required and stored somewhere, e.g. google cloud storage cold-line or archive storage, which can cost less than $0.01 per GB/month

chrismclarke commented 1 year ago

In case of interest, some of the output of git filter-repo --analyze --force analysis for deleted files/folders

directories-deleted-sizes.txt

path-deleted-sizes.txt

These can be used to help determine what might be good to scrub from the repo. E.g. taking the largest folder no longer present in the repo instat/static/InstatObject/R/extras (removed 2020-02-07)

Before

git count-objects -v -H

count: 0
size: 0 bytes
in-pack: 191512
packs: 1
size-pack: 4.48 GiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

Example Operation

git filter-repo --path instat/static/InstatObject/R/extras --invert-paths

After - already reduced almost 3.5GB to total size just over 1GB

git count-objects -v -H

count: 0
size: 0 bytes
in-pack: 186267
packs: 1
size-pack: 1.04 GiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

The next largest deleted folders I can see are instat/static/Library (removed 2021-02-17) and instat/static/InstatObject/R/RPackages (removed 2016-09-14) Removal takes repo down to to 435MB total.

Folder-based gains start to drop off from there onwards, but could still prune quite a lot of deleted files if reasonable to do so

Although full disclaimer I don't know much about the context of this repos or the files/folders removed, so would still recommend a cautious approach (plus backups) if planning to implement similar steps as those outlined above

lloyddewit commented 1 year ago

@chrismclarke thank you for taking the trouble to share all the interesting info above.

@rdstern @ChrisMarsh82

One thing I would recommend before taking any other action would be to recommend to any users to clone with the --depth 1 flag for anyone that does not need the full git history, e.g.

git clone --depth 1 https://github.com/africanmathsinitiative/R-Instat.git

I haven't tested this, but it could potentially be a very valuable tip for anyone trying to clone R-Instat with low-bandwidth.