Closed igormp closed 4 years ago
what command would be required to delete it @igormp can you give me some insight?
Looking at this SO answer, using bfg with bfg --strip-blobs-bigger-than 20M tika-python.git
and then running git reflog expire --expire=now --all && git gc --prune=now --aggressive
inside the repo folder seemed to do the job, reducing the repo size from a little over 30MB into just 5MB. It removed 2 big files, as seen bellow.
Deleted files
-------------
Filename Git id
----------------------------------------------
tika-app-1.6-SNAPSHOT.jar | e39ddeed (28.3 MB)
tika-app-1.6.jar | 2dc1dcf2 (28.5 MB)
I believe that it'd be a good idea to delete some stale branches before doing so in order to tidy things up before rewriting history to remove those big files.
OK I have bfg installed, so I could run the above commands, and here's what I got:
pomodoro:git mattmann$ bfg --strip-blobs-bigger-than 20M tika-python/
Using repo : /Users/mattmann/git/tika-python/.git
Scanning packfile for large blobs: 311
Scanning packfile for large blobs completed in 47 ms.
Warning : no large blobs matching criteria found in packfiles - does the repo need to be packed?
Please specify tasks for The BFG :
bfg 1.13.0
Usage: bfg [options] [<repo>]
-b, --strip-blobs-bigger-than <size>
strip blobs bigger than X (eg '128K', '1M', etc)
-B, --strip-biggest-blobs NUM
strip the top NUM biggest blobs
-bi, --strip-blobs-with-ids <blob-ids-file>
strip blobs with the specified Git object ids
-D, --delete-files <glob>
delete files with the specified names (eg '*.class', '*.{txt,log}' - matches on file name, not path within repo)
--delete-folders <glob> delete folders with the specified names (eg '.svn', '*-tmp' - matches on folder name, not path within repo)
--convert-to-git-lfs <value>
extract files with the specified names (eg '*.zip' or '*.mp4') into Git LFS
-rt, --replace-text <expressions-file>
filter content of files, replacing matched text. Match expressions should be listed in the file, one expression per line - by default, each expression is treated as a literal, but 'regex:' & 'glob:' prefixes are supported, with '==>' to specify a replacement string other than the default of '***REMOVED***'.
-fi, --filter-content-including <glob>
do file-content filtering on files that match the specified expression (eg '*.{txt,properties}')
-fe, --filter-content-excluding <glob>
don't do file-content filtering on files that match the specified expression (eg '*.{xml,pdf}')
-fs, --filter-content-size-threshold <size>
only do file-content filtering on files smaller than <size> (default is 1048576 bytes)
-p, --protect-blobs-from <refs>
protect blobs that appear in the most recent versions of the specified refs (default is 'HEAD')
--no-blob-protection allow the BFG to modify even your *latest* commit. Not recommended: you should have already ensured your latest commit is clean.
--private treat this repo-rewrite as removing private data (for example: omit old commit ids from commit messages)
--massive-non-file-objects-sized-up-to <size>
increase memory usage to handle over-size Commits, Tags, and Trees that are up to X in size (eg '10M')
<repo> file path for Git repository to clean
pomodoro:git mattmann$
It's not finding anything....but I did find this which described running git gc
to repack. After doing that, I got this:
pomodoro:tika-python mattmann$ bfg --strip-blobs-bigger-than 20M
Using repo : /Users/mattmann/git/tika-python/.git
Scanning packfile for large blobs: 1432
Scanning packfile for large blobs completed in 50 ms.
Found 2 blob ids for large blobs - biggest=29845408 smallest=29645937
Total size (unpacked)=59491345
Found 26 objects to protect
Found 17 tag-pointing refs : refs/tags/1.10, refs/tags/1.11, refs/tags/1.12, ...
Found 53 commit-pointing refs : HEAD, refs/heads/add-language, refs/heads/add-pip-directions, ...
Protected commits
-----------------
These are your protected commits, and so their contents will NOT be altered:
* commit 8b88be2e (protected by 'HEAD')
Cleaning
--------
Found 496 commits
Cleaning commits: 100% (496/496)
Cleaning commits completed in 366 ms.
Updating 16 Refs
----------------
Ref Before After
-------------------------------------------------------------
refs/heads/add-language | 928ef7a8 | bda58bdd
refs/heads/add-translate | 8dff73e7 | b8582ec9
refs/heads/fix-tests | f14aedb6 | c641cd89
refs/heads/fix-win | e4f9a117 | 3ee33466
refs/heads/tika-rest | 9f1558c3 | 9f2bb6c7
refs/heads/update-tika-16 | df2f6676 | eabc833b
refs/heads/update-tika16 | 698eeaeb | fbb8c667
refs/remotes/origin/add-language | 928ef7a8 | bda58bdd
refs/remotes/origin/add-translate | 8dff73e7 | b8582ec9
refs/remotes/origin/backup-master | f14aedb6 | c641cd89
refs/remotes/origin/fix-tests | f14aedb6 | c641cd89
refs/remotes/origin/fix-win | e4f9a117 | 3ee33466
refs/remotes/origin/revert-22-tika-rest | 1ca3b789 | ea5f82c6
refs/remotes/origin/tika-rest | 9f1558c3 | 9f2bb6c7
refs/remotes/origin/update-tika16 | 698eeaeb | fbb8c667
refs/tags/1.8.7 | 86f0165f | b9c86345
Updating references: 100% (16/16)
...Ref update completed in 64 ms.
Commit Tree-Dirt History
------------------------
Earliest Latest
| |
.DDDDDDDDDmmmmmmmmmmmmmmmmm.................................
D = dirty commits (file tree fixed)
m = modified commits (commit message or parents changed)
. = clean commits (no changes to file tree)
Before After
-------------------------------------------
First modified commit | 5291d025 | 4d7eba7c
Last dirty commit | 3c016b6d | 0f29ee5e
Deleted files
-------------
Filename Git id
----------------------------------------------
tika-app-1.6-SNAPSHOT.jar | e39ddeed (28.3 MB)
tika-app-1.6.jar | 2dc1dcf2 (28.5 MB)
In total, 150 object ids were changed. Full details are logged here:
/Users/mattmann/git/tika-python.bfg-report/2019-11-23/09-03-54
BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive
--
You can rewrite history in Git - don't let Trump do it for real!
Trump's administration has lied consistently, to make people give up on ever
being told the truth. Don't give up: https://www.theguardian.com/us-news/trump-administration
--
That looks right. Then I ran
pomodoro:tika-python mattmann$ git reflog expire --expire=now --all && git gc --prune=now --aggressive
Enumerating objects: 1432, done.
Counting objects: 100% (1432/1432), done.
Delta compression using up to 12 threads
Compressing objects: 100% (1398/1398), done.
Writing objects: 100% (1432/1432), done.
Total 1432 (delta 917), reused 401 (delta 0)
pomodoro:tika-python mattmann$
All good!
When cloning the repo, it downloads over 30MB of data, something that I consider kinda weird.
I found #34 and tried to see if there isn't any big file left in the history, but I ended up finding out that the most likely culprit is the
update-tika16
branch, which contains some jar files. Would it be possible to delete this branch, since it has seen no updates in over 5 years?