biomedicalinformaticsgroup / Sargasso

Sargasso disambiguates mixed-species high-throughput sequencing data.
http://biomedicalinformaticsgroup.github.io/Sargasso/
Other
8 stars 4 forks source link

Clean large file in git history #100

Closed hxin closed 5 years ago

hxin commented 5 years ago

plan to follow this

Use this to remove large file. Anything messes up with git history is risky... I am not sure how this work atm so will need to check a few more things before acturally push the change back. Will also confirm with @lweasel before making any change to the repo.

java -jar bfg-1.13.0.jar --strip-blobs-bigger-than 10M Sargasso

Using repo : /Users/xinhe/tmp/Sargasso.git

Scanning packfile for large blobs: 2777
Scanning packfile for large blobs completed in 61 ms.
Found 11 blob ids for large blobs - biggest=43286292 smallest=10565433
Total size (unpacked)=230395636
Found 56 objects to protect
Found 32 commit-pointing refs : HEAD, refs/heads/bisulfite_seq, refs/heads/dev, ...

Protected commits
-----------------

These are your protected commits, and so their contents will NOT be altered:

 * commit e9d33c41 (protected by 'HEAD')

Cleaning
--------

Found 422 commits
Cleaning commits:       100% (422/422)
Cleaning commits completed in 832 ms.

Updating 7 Refs
---------------

    Ref                        Before     After
    ----------------------------------------------
    refs/heads/bisulfite_seq | 0f9585e8 | 930ca946
    refs/heads/dev           | 74a7d94f | 351b74b5
    refs/heads/dev-pre-merge | e9d33c41 | 11ff4f28
    refs/heads/master        | e9d33c41 | 11ff4f28
    refs/pull/78/head        | 0cbe0b79 | 2c843fcc
    refs/pull/95/head        | f08cf0b5 | 8ac289cd
    refs/tags/v2.0           | e9d33c41 | 11ff4f28

Updating references:    100% (7/7)
...Ref update completed in 59 ms.

Commit Tree-Dirt History
------------------------

    Earliest                                              Latest
    |                                                          |
    ...........................................DDDD.DDDDDDDDDDDD

    D = dirty commits (file tree fixed)
    m = modified commits (commit message or parents changed)
    . = clean commits (no changes to file tree)

                            Before     After
    -------------------------------------------
    First modified commit | 853fd0bd | 058c3f02
    Last dirty commit     | 74a7d94f | 351b74b5

Deleted files
-------------

    Filename                                            Git id
    ------------------------------------------------------------------------------------------
    bisulfite_human_pe_sample.human.bam               | b46a9c34 (18.4 MB), 69b88675 (15.0 MB)
    bisulfite_human_pe_sample.human.premerge.bam      | eb13efb0 (15.2 MB)
    bisulfite_human_pe_sample___human___BLOCK___1.bam | 101895be (41.2 MB)
    bisulfite_human_pe_sample___human___BLOCK___2.bam | 53abc0a7 (41.3 MB)
    bisulfite_human_se_sample.human.bam               | a89af790 (10.1 MB)
    bisulfite_human_se_sample___human___BLOCK___1.bam | 021565ac (16.1 MB)
    bisulfite_human_se_sample___human___BLOCK___2.bam | 53d80afe (16.1 MB)
    rnaseq_mouse_rat_sample___mouse___BLOCK___1.bam   | 63b34379 (10.1 MB)
    rnaseq_mouse_rat_sample___rat___BLOCK___1.bam     | 40c465fe (18.1 MB)
    rnaseq_mouse_rat_sample___rat___BLOCK___2.bam     | 21563153 (18.1 MB)

In total, 226 object ids were changed. Full details are logged here:

    /Users/xinhe/tmp/Sargasso.git.bfg-report/2019-05-30/14-37-24

BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive

--
You can rewrite history in Git - don't let Trump do it for real!
Trump's administration has lied consistently, to make people give up on ever
being told the truth. Don't give up: https://www.theguardian.com/us-news/trump-administration
--
hxin commented 5 years ago

The BFG tool is also mentioned by github help page

hxin commented 5 years ago

I fork the repo and will use the forked repo for testing.

5f7b3394e88a  1.1MiB tests/data/pe/chipseq/filtered_reads/Blocks/chiseq_mouse_sample___mouse___BLOCK___2.bam
7664ffc2f03e  1.1MiB tests/data/pe/chipseq/filtered_reads/Blocks/chiseq_mouse_sample___mouse___BLOCK___1.bam
87c5387ae7ed  1.2MiB tests/data/pe/rnaseq/filtered_reads/rnaseq_mouse_rat_sample_mouse_filtered.bam
8b09117b550d  1.2MiB tests/data/se/chipseq/filtered_reads/Blocks/chiseq_mouse_se_sample___mouse___BLOCK___1.bam
7e003799c8cf  1.3MiB tests/data/se/chipseq/filtered_reads/Blocks/chiseq_mouse_se_sample___mouse___BLOCK___2.bam
df98e9adfd83  1.5MiB pipeline_test/data/bam/sample_reads.human.bam
4f91b3c1941c  1.5MiB tests/data/pe/rnaseq/sorted_reads/rnaseq_mouse_rat_sample.human.bam
c054e2ea2ef1  2.8MiB pipeline_test/data/fastq/mouse_rat_test_1.fastq.gz
68f1a5b71a92  2.9MiB pipeline_test/data/fastq/mouse_rat_test_2.fastq.gz
982b4b5e4225  3.1MiB tests/data/pe/rnaseq/filtered_reads/Blocks/rnaseq_mouse_rat_sample___human___BLOCK___1.bam
15f50b0528c4  3.2MiB tests/data/pe/rnaseq/filtered_reads/Blocks/rnaseq_mouse_rat_sample___human___BLOCK___2.bam
91c15a0935a5  3.6MiB tests/data/se/bisulfite/filtered_reads/bisulfite_human_se_sample___human___1___filtered.bam
ef8e9ebc795b  3.6MiB tests/data/pe/chipseq/filtered_reads/Blocks/chiseq_mouse_sample___mouse___BLOCK___1.bam
3b6979566ab1  3.6MiB tests/data/se/bisulfite/filtered_reads/bisulfite_human_se_sample___human___0___filtered.bam
733ddf4693c4  3.6MiB tests/data/pe/chipseq/filtered_reads/Blocks/chiseq_mouse_sample___mouse___BLOCK___2.bam
794ce6ce7bea  4.0MiB tests/data/raw_reads/bisulfite_human_pe_R1.fastq.gz
81678411ada0  4.0MiB tests/data/pe/rnaseq/filtered_reads/rnaseq_mouse_rat_sample_rat_1_filtered.bam
635d08091a8b  4.1MiB tests/data/pe/rnaseq/filtered_reads/rnaseq_mouse_rat_sample_rat_0_filtered.bam
3cd1081155f2  4.7MiB tests/data/pe/bisulfite/filtered_reads/bisulfite_human_pe_sample___human___0___filtered.bam
7e81c6c56a69  4.8MiB tests/data/pe/bisulfite/filtered_reads/bisulfite_human_pe_sample___human___1___filtered.bam
568f5c0a984f  5.1MiB tests/data/raw_reads/bisulfite_human_pe_R2.fastq.gz
b0da1428a772  5.5MiB pipeline_test/data/bam/sample_reads.mouse.bam
a367d7940712  5.5MiB tests/data/pe/rnaseq/sorted_reads/rnaseq_mouse_rat_sample.mouse.bam
1b1ed02c163a  7.2MiB tests/data/se/bisulfite/filtered_reads/bisulfite_human_se_sample___human___filtered.bam
811066e69490  7.6MiB tests/data/raw_reads/bisulfite_human_se.fastq.gz
30220ba773cd  8.1MiB tests/data/pe/rnaseq/filtered_reads/rnaseq_mouse_rat_sample_rat_filtered.bam
a5adcea049ff  8.5MiB tests/data/se/bisulfite/sorted_reads/bisulfite_human_se_sample.human.premerge.bam
d391f5a7b87c  8.5MiB tests/data/se/bisulfite/mapped_reads/bisulfite_human_se_sample.human.bam
1a44ecfed55c  9.4MiB pipeline_test/data/bam/sample_reads.rat.bam
b02776cc3020  9.4MiB tests/data/pe/rnaseq/sorted_reads/rnaseq_mouse_rat_sample.rat.bam
c7897e505f78  9.5MiB tests/data/pe/bisulfite/filtered_reads/bisulfite_human_pe_sample___human___filtered.bam
1766a7fa2e51   10MiB tests/data/pe/rnaseq/filtered_reads/Blocks/rnaseq_mouse_rat_sample___mouse___BLOCK___2.bam
63b3437938ba   10MiB tests/data/pe/rnaseq/filtered_reads/Blocks/rnaseq_mouse_rat_sample___mouse___BLOCK___1.bam
a89af79006d4   10MiB tests/data/se/bisulfite/sorted_reads/bisulfite_human_se_sample.human.bam
69b886750287   15MiB tests/data/pe/bisulfite/mapped_reads/bisulfite_human_pe_sample.human.bam
eb13efb043aa   15MiB tests/data/pe/bisulfite/sorted_reads/bisulfite_human_pe_sample.human.premerge.bam
53d80afe6a70   16MiB tests/data/se/bisulfite/filtered_reads/Blocks/bisulfite_human_se_sample___human___BLOCK___2.bam
021565acacff   16MiB tests/data/se/bisulfite/filtered_reads/Blocks/bisulfite_human_se_sample___human___BLOCK___1.bam
21563153931d   18MiB tests/data/pe/rnaseq/filtered_reads/Blocks/rnaseq_mouse_rat_sample___rat___BLOCK___2.bam
40c465fe3383   18MiB tests/data/pe/rnaseq/filtered_reads/Blocks/rnaseq_mouse_rat_sample___rat___BLOCK___1.bam
b46a9c343f87   18MiB tests/data/pe/bisulfite/sorted_reads/bisulfite_human_pe_sample.human.bam
101895be216f   41MiB tests/data/pe/bisulfite/filtered_reads/Blocks/bisulfite_human_pe_sample___human___BLOCK___1.bam
53abc0a7b419   41MiB tests/data/pe/bisulfite/filtered_reads/Blocks/bisulfite_human_pe_sample___human___BLOCK___2.bam
hxin commented 5 years ago

Here is the plan:

git clone --mirror https://github.com/statbio/Sargasso.git &
#git clone https://github.com/statbio/Sargasso.git &
wget https://repo1.maven.org/maven2/com/madgag/bfg/1.13.0/bfg-1.13.0.jar &

cd Sargasso.git

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| sort --numeric-sort --key=2 \
| cut -c 1-12,41- \
| $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

cd ..

# java -jar bfg-1.13.0.jar --strip-blobs-bigger-than 20M ~/tmp/Sargasso.git

java -jar bfg-1.13.0.jar --delete-files bisulfite_human_se_sample___human___1___filtered.bam ~/tmp/Sargasso.git
java -jar bfg-1.13.0.jar --delete-files bisulfite_human_se_sample___human___0___filtered.bam ~/tmp/Sargasso.git
java -jar bfg-1.13.0.jar --delete-files bisulfite_human_pe_R1.fastq.gz ~/tmp/Sargasso.git
java -jar bfg-1.13.0.jar --delete-files bisulfite_human_pe_sample___human___0___filtered.bam ~/tmp/Sargasso.git
java -jar bfg-1.13.0.jar --delete-files bisulfite_human_pe_sample___human___1___filtered.bam ~/tmp/Sargasso.git
java -jar bfg-1.13.0.jar --delete-files bisulfite_human_pe_R2.fastq.gz ~/tmp/Sargasso.git
java -jar bfg-1.13.0.jar --delete-files bisulfite_human_se_sample___human___filtered.bam ~/tmp/Sargasso.git
java -jar bfg-1.13.0.jar --delete-files bisulfite_human_se.fastq.gz ~/tmp/Sargasso.git
java -jar bfg-1.13.0.jar --delete-files bisulfite_human_se_sample.human.premerge.bam ~/tmp/Sargasso.git
java -jar bfg-1.13.0.jar --delete-files bisulfite_human_se_sample.human.bam ~/tmp/Sargasso.git
java -jar bfg-1.13.0.jar --delete-files bisulfite_human_pe_sample___human___filtered.bam ~/tmp/Sargasso.git
java -jar bfg-1.13.0.jar --delete-files bisulfite_human_se_sample.human.bam ~/tmp/Sargasso.git
java -jar bfg-1.13.0.jar --delete-files bisulfite_human_pe_sample.human.bam ~/tmp/Sargasso.git
java -jar bfg-1.13.0.jar --delete-files bisulfite_human_pe_sample.human.premerge.bam ~/tmp/Sargasso.git
java -jar bfg-1.13.0.jar --delete-files bisulfite_human_se_sample___human___BLOCK___2.bam ~/tmp/Sargasso.git
java -jar bfg-1.13.0.jar --delete-files bisulfite_human_se_sample___human___BLOCK___1.bam ~/tmp/Sargasso.git
java -jar bfg-1.13.0.jar --delete-files bisulfite_human_pe_sample.human.bam ~/tmp/Sargasso.git
java -jar bfg-1.13.0.jar --delete-files bisulfite_human_pe_sample___human___BLOCK___1.bam ~/tmp/Sargasso.git
java -jar bfg-1.13.0.jar --delete-files bisulfite_human_pe_sample___human___BLOCK___2.bam ~/tmp/Sargasso.git

cd Sargasso.git
git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push

This will remove the large files and change the commit hash, just in the bisulfite branch. A git pull is required afterwards to update other local Sargasso repo.

hxin commented 5 years ago

This reduce the size of the repo from 230mb to 77mb.