edgi-govdata-archiving / 100days

Website for EDGI 100 Days Report
https://100days.envirodatagov.org
Other
2 stars 3 forks source link

Remove large files from git history #35

Closed patcon closed 7 years ago

patcon commented 7 years ago

Reticketed from #32

https://rtyley.github.io/bfg-repo-cleaner/

Biggest image is 1.3MB, so anything over 2MB is probably something we could strip out (?)

Blocked on #33 bc existing PRs will break after this

dcwalk commented 7 years ago

Yo! We should move on this... steps are:

  1. git clone --mirror git@github.com:edgi-govdata-archiving/100days.git
  2. java -jar bfg.jar --strip-blobs-bigger-than 2M 100days.git/
  3. cd 100days.git
  4. git reflog expire --expire=now --all && git gc --prune=now --aggressive

I'm ready to do this now if that works... just tested and my report is showing...

Scanning packfile for large blobs: 274
Scanning packfile for large blobs completed in 33 ms.
Found 1 blob ids for large blobs - biggest=11744349 smallest=11744349
Total size (unpacked)=11744349
Found 32 objects to protect
Found 22 commit-pointing refs : HEAD, refs/heads/master, refs/pull/1/head, ...

Protected commits
-----------------

These are your protected commits, and so their contents will NOT be altered:

 * commit eb3b2c59 (protected by 'HEAD')

Cleaning
--------

Found 76 commits
Cleaning commits:       100% (76/76)
Cleaning commits completed in 330 ms.

Updating 19 Refs
----------------

    Ref                  Before     After   
    ----------------------------------------
    refs/heads/master  | eb3b2c59 | 5f009531
    refs/pull/14/head  | 7da1e7ab | 2fc96ff1
    refs/pull/18/head  | a9db8013 | 897b5869
    refs/pull/20/head  | ec62d5c5 | 075dacd6
    refs/pull/21/head  | 72d7e137 | 761a5f5f
    refs/pull/23/head  | aa95627e | 51ce6769
    refs/pull/25/head  | b73aa867 | f22cdbb7
    refs/pull/26/head  | 6c8e19f2 | b7572959
    refs/pull/26/merge | c0441e4c | 91c996cc
    refs/pull/28/head  | 31b2927a | e210e811
    refs/pull/32/head  | 89f9ed8d | a5ea1cf4
    refs/pull/33/head  | 933f3faa | eaecaf85
    refs/pull/34/head  | ce2f6748 | e771d605
    refs/pull/36/head  | d35fec30 | 6aa092b2
    refs/pull/37/head  | b9e3b60e | debf89d7
    ...

Updating references:    100% (19/19)
...Ref update completed in 50 ms.

Commit Tree-Dirt History
------------------------

    Earliest                                              Latest
    |                                                          |
    ......DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDmDmmmDmmmmmmmmm

    D = dirty commits (file tree fixed)
    m = modified commits (commit message or parents changed)
    . = clean commits (no changes to file tree)

                            Before     After   
    -------------------------------------------
    First modified commit | 50681e0c | 7731ce02
    Last dirty commit     | ce2f6748 | e771d605

Deleted files
-------------

    Filename                     Git id            
    -----------------------------------------------
    Part-1-EPA-Under-Siege.pdf | b0694f57 (11.2 MB)

In total, 126 object ids were changed.
patcon commented 7 years ago

Yeah, do it! No pending PRs, so perf!

dcwalk commented 7 years ago

Okay! This is done -- FYI there was an error message when pushing back, which is caused by the reference to the large file in the record of previous PRs (documented here: https://github.com/rtyley/bfg-repo-cleaner/issues/36) I followed the steps in that issue to document... and when I do a fresh clone of the repo it is 1/4 the starting size 🎉

screen shot 2017-07-07 at 11 38 01 am
dcwalk commented 7 years ago

@patcon and @shaqsingh -- I believe you should reclone the repo for any fresh commits and/or rebase of master if there is something in progress!

patcon commented 7 years ago

If i recall, rebase would get hella messy. I suspect reclone and copy-pasta is the way to go, if that happens