Open lloyddewit opened 4 years ago
I've been looking at similar issues (although much smaller scale) on another IDEMS repo.
One thing I would recommend before taking any other action would be to recommend to any users to clone with the --depth 1
flag for anyone that does not need the full git history, e.g.
git clone --depth 1 https://github.com/africanmathsinitiative/R-Instat.git
remote: Enumerating objects: 1880, done.
remote: Counting objects: 100% (1880/1880), done.
remote: Compressing objects: 100% (1352/1352), done.
remote: Total 1880 (delta 986), reused 927 (delta 476), pack-reused 0
Receiving objects: 100% (1880/1880), 66.11 MiB | 3.26 MiB/s, done.
Resolving deltas: 100% (986/986), done.
Updating files: 100% (2214/2214), done.
66MB received when compared to full clone 4.47GB
remote: Enumerating objects: 191512, done.
remote: Counting objects: 100% (1279/1279), done.
remote: Compressing objects: 100% (453/453), done.
remote: Total 191512 (delta 888), reused 1110 (delta 806), pack-reused 190233
Receiving objects: 100% (191512/191512), 4.47 GiB | 4.11 MiB/s, done.
Resolving deltas: 100% (130765/130765), done.
I've used git BFG a lot in the past and it's really good for deleting all files with a given extension, although if you are unsure whether there might still be some files needed you could also look at using Git Filter Repo which has some nice analysis tools and can remove based on filepath names.
As for risks, ideally a backup should be taken just in case required and stored somewhere, e.g. google cloud storage cold-line or archive storage, which can cost less than $0.01 per GB/month
In case of interest, some of the output of git filter-repo --analyze --force
analysis for deleted files/folders
These can be used to help determine what might be good to scrub from the repo. E.g. taking the largest folder no longer present in the repo instat/static/InstatObject/R/extras
(removed 2020-02-07)
Before
git count-objects -v -H
count: 0
size: 0 bytes
in-pack: 191512
packs: 1
size-pack: 4.48 GiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes
Example Operation
git filter-repo --path instat/static/InstatObject/R/extras --invert-paths
After - already reduced almost 3.5GB to total size just over 1GB
git count-objects -v -H
count: 0
size: 0 bytes
in-pack: 186267
packs: 1
size-pack: 1.04 GiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes
The next largest deleted folders I can see are instat/static/Library
(removed 2021-02-17) and instat/static/InstatObject/R/RPackages
(removed 2016-09-14)
Removal takes repo down to to 435MB total.
Folder-based gains start to drop off from there onwards, but could still prune quite a lot of deleted files if reasonable to do so
Although full disclaimer I don't know much about the context of this repos or the files/folders removed, so would still recommend a cautious approach (plus backups) if planning to implement similar steps as those outlined above
@chrismclarke thank you for taking the trouble to share all the interesting info above.
@rdstern @ChrisMarsh82
One thing I would recommend before taking any other action would be to recommend to any users to clone with the
--depth 1
flag for anyone that does not need the full git history, e.g.git clone --depth 1 https://github.com/africanmathsinitiative/R-Instat.git
I haven't tested this, but it could potentially be a very valuable tip for anyone trying to clone R-Instat with low-bandwidth.
The current R-Instat GitHub repository (repo) is currently around 4.2 GB. This means that the repo can only be reliably cloned if the developer has a good Internet connection. Some of the interns are currently unable to clone the repo.
The repo is currently also too large to clone internally within GitHub. This is because GitHub will not clone a repo that has files larger than 100 MB. The R-Instat repo has packed object files (used by Git internally) that are larger than 100 MB. This prevents us making a cloned repo in GitHub for backup or experimentation.
The large repo size is caused by data files that have since been deleted. If a file is deleted then GitHub still stores all previous versions of it internally.
If we remove all deleted data files then the repo will reduce in size from 4.2 GB to 0.14 GB (a 96% reduction).
I tested with the following commands in Git Bash:
Notes:
bfg-repo-cleaner
tool. This can be downloaded from here.push
reports an error at the end:! [remote rejected] refs/pull/1/head -> refs/pull/1/head (deny updating a hidden ref)
. Other GitHub users also reported this error. It is triggered becauserefs/pull
is a special “read-only” ref for pull requests managed by GitHub so we can’t push directly to these to rewrite history more details here. The rest of thepush
seems to work correctly.bfg-repo-cleaner
rewrites the repo's history. This changes the SHAs for existing commits that are altered and any dependent commits. Changed commit SHAs may affect open pull requests in the repo. Therefore before executing the steps above, all PRs should be closed. After executing the steps above, all developers should make a new clone of the main repo.$ find . -iname '*.csv' -print0 | du --files0-from - -c -hs |tail -1 71M total
$ find . -iname '*.xlsx' -print0 | du --files0-from - -c -hs |tail -1 7.0M total