chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 235 forks source link

Repository is too heavy #255

Closed igormp closed 4 years ago

igormp commented 4 years ago

When cloning the repo, it downloads over 30MB of data, something that I consider kinda weird.

Cloning into 'tika-python'...
Warning: Permanently added the RSA host key for IP address '140.82.114.3' to the list of known hosts.
remote: Enumerating objects: 69, done.
remote: Counting objects: 100% (69/69), done.
remote: Compressing objects: 100% (53/53), done.
remote: Total 1418 (delta 39), reused 35 (delta 16), pack-reused 1349
Receiving objects: 100% (1418/1418), 32.14 MiB | 427.00 KiB/s, done.
Resolving deltas: 100% (787/787), done.

I found #34 and tried to see if there isn't any big file left in the history, but I ended up finding out that the most likely culprit is the update-tika16 branch, which contains some jar files. Would it be possible to delete this branch, since it has seen no updates in over 5 years?

chrismattmann commented 4 years ago

what command would be required to delete it @igormp can you give me some insight?

igormp commented 4 years ago

Looking at this SO answer, using bfg with bfg --strip-blobs-bigger-than 20M tika-python.git and then running git reflog expire --expire=now --all && git gc --prune=now --aggressive inside the repo folder seemed to do the job, reducing the repo size from a little over 30MB into just 5MB. It removed 2 big files, as seen bellow.

Deleted files
-------------

    Filename                    Git id            
    ----------------------------------------------
    tika-app-1.6-SNAPSHOT.jar | e39ddeed (28.3 MB)
    tika-app-1.6.jar          | 2dc1dcf2 (28.5 MB)

I believe that it'd be a good idea to delete some stale branches before doing so in order to tidy things up before rewriting history to remove those big files.

chrismattmann commented 4 years ago

OK I have bfg installed, so I could run the above commands, and here's what I got:

pomodoro:git mattmann$ bfg --strip-blobs-bigger-than 20M tika-python/

Using repo : /Users/mattmann/git/tika-python/.git

Scanning packfile for large blobs: 311
Scanning packfile for large blobs completed in 47 ms.
Warning : no large blobs matching criteria found in packfiles - does the repo need to be packed?
Please specify tasks for The BFG :
bfg 1.13.0
Usage: bfg [options] [<repo>]

  -b, --strip-blobs-bigger-than <size>
                           strip blobs bigger than X (eg '128K', '1M', etc)
  -B, --strip-biggest-blobs NUM
                           strip the top NUM biggest blobs
  -bi, --strip-blobs-with-ids <blob-ids-file>
                           strip blobs with the specified Git object ids
  -D, --delete-files <glob>
                           delete files with the specified names (eg '*.class', '*.{txt,log}' - matches on file name, not path within repo)
  --delete-folders <glob>  delete folders with the specified names (eg '.svn', '*-tmp' - matches on folder name, not path within repo)
  --convert-to-git-lfs <value>
                           extract files with the specified names (eg '*.zip' or '*.mp4') into Git LFS
  -rt, --replace-text <expressions-file>
                           filter content of files, replacing matched text. Match expressions should be listed in the file, one expression per line - by default, each expression is treated as a literal, but 'regex:' & 'glob:' prefixes are supported, with '==>' to specify a replacement string other than the default of '***REMOVED***'.
  -fi, --filter-content-including <glob>
                           do file-content filtering on files that match the specified expression (eg '*.{txt,properties}')
  -fe, --filter-content-excluding <glob>
                           don't do file-content filtering on files that match the specified expression (eg '*.{xml,pdf}')
  -fs, --filter-content-size-threshold <size>
                           only do file-content filtering on files smaller than <size> (default is 1048576 bytes)
  -p, --protect-blobs-from <refs>
                           protect blobs that appear in the most recent versions of the specified refs (default is 'HEAD')
  --no-blob-protection     allow the BFG to modify even your *latest* commit. Not recommended: you should have already ensured your latest commit is clean.
  --private                treat this repo-rewrite as removing private data (for example: omit old commit ids from commit messages)
  --massive-non-file-objects-sized-up-to <size>
                           increase memory usage to handle over-size Commits, Tags, and Trees that are up to X in size (eg '10M')
  <repo>                   file path for Git repository to clean
pomodoro:git mattmann$ 

It's not finding anything....but I did find this which described running git gc to repack. After doing that, I got this:

pomodoro:tika-python mattmann$ bfg --strip-blobs-bigger-than 20M

Using repo : /Users/mattmann/git/tika-python/.git

Scanning packfile for large blobs: 1432
Scanning packfile for large blobs completed in 50 ms.
Found 2 blob ids for large blobs - biggest=29845408 smallest=29645937
Total size (unpacked)=59491345
Found 26 objects to protect
Found 17 tag-pointing refs : refs/tags/1.10, refs/tags/1.11, refs/tags/1.12, ...
Found 53 commit-pointing refs : HEAD, refs/heads/add-language, refs/heads/add-pip-directions, ...

Protected commits
-----------------

These are your protected commits, and so their contents will NOT be altered:

 * commit 8b88be2e (protected by 'HEAD')

Cleaning
--------

Found 496 commits
Cleaning commits:       100% (496/496)
Cleaning commits completed in 366 ms.

Updating 16 Refs
----------------

    Ref                                       Before     After   
    -------------------------------------------------------------
    refs/heads/add-language                 | 928ef7a8 | bda58bdd
    refs/heads/add-translate                | 8dff73e7 | b8582ec9
    refs/heads/fix-tests                    | f14aedb6 | c641cd89
    refs/heads/fix-win                      | e4f9a117 | 3ee33466
    refs/heads/tika-rest                    | 9f1558c3 | 9f2bb6c7
    refs/heads/update-tika-16               | df2f6676 | eabc833b
    refs/heads/update-tika16                | 698eeaeb | fbb8c667
    refs/remotes/origin/add-language        | 928ef7a8 | bda58bdd
    refs/remotes/origin/add-translate       | 8dff73e7 | b8582ec9
    refs/remotes/origin/backup-master       | f14aedb6 | c641cd89
    refs/remotes/origin/fix-tests           | f14aedb6 | c641cd89
    refs/remotes/origin/fix-win             | e4f9a117 | 3ee33466
    refs/remotes/origin/revert-22-tika-rest | 1ca3b789 | ea5f82c6
    refs/remotes/origin/tika-rest           | 9f1558c3 | 9f2bb6c7
    refs/remotes/origin/update-tika16       | 698eeaeb | fbb8c667
    refs/tags/1.8.7                         | 86f0165f | b9c86345

Updating references:    100% (16/16)
...Ref update completed in 64 ms.

Commit Tree-Dirt History
------------------------

    Earliest                                              Latest
    |                                                          |
    .DDDDDDDDDmmmmmmmmmmmmmmmmm.................................

    D = dirty commits (file tree fixed)
    m = modified commits (commit message or parents changed)
    . = clean commits (no changes to file tree)

                            Before     After   
    -------------------------------------------
    First modified commit | 5291d025 | 4d7eba7c
    Last dirty commit     | 3c016b6d | 0f29ee5e

Deleted files
-------------

    Filename                    Git id            
    ----------------------------------------------
    tika-app-1.6-SNAPSHOT.jar | e39ddeed (28.3 MB)
    tika-app-1.6.jar          | 2dc1dcf2 (28.5 MB)

In total, 150 object ids were changed. Full details are logged here:

    /Users/mattmann/git/tika-python.bfg-report/2019-11-23/09-03-54

BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive

--
You can rewrite history in Git - don't let Trump do it for real!
Trump's administration has lied consistently, to make people give up on ever
being told the truth. Don't give up: https://www.theguardian.com/us-news/trump-administration
--

That looks right. Then I ran


pomodoro:tika-python mattmann$ git reflog expire --expire=now --all && git gc --prune=now --aggressive
Enumerating objects: 1432, done.
Counting objects: 100% (1432/1432), done.
Delta compression using up to 12 threads
Compressing objects: 100% (1398/1398), done.
Writing objects: 100% (1432/1432), done.
Total 1432 (delta 917), reused 401 (delta 0)
pomodoro:tika-python mattmann$ 

All good!