AllenCellModeling / aicsimageio

Image Reading, Metadata Conversion, and Image Writing for Microscopy Images in Python
https://allencellmodeling.github.io/aicsimageio
Other
207 stars 51 forks source link

.git/objects/pack is huge, consider history rewrite? #451

Closed tlambert03 closed 1 year ago

tlambert03 commented 1 year ago

i'm on a super slow internet at the moment and wanted to do a little work on aicsimageio. I tried to clone the repo and it took a long time... though the direct zip download was only 2MB (most of which are the presentations, the source itself is only 800K unzipped)

the full repo is 337 MB, and 333M of that is in .git/objects/pack ... which i suspect indicates that at one point in the past, test images were included in the repo? I wonder how folks would feel about a git filter-branch rewrite? https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository

evamaxfield commented 1 year ago

cc @toloudis cc @AetherUnbound

Yes, at one point we kept test images in GitLFS and I think even git accidentally. I would be very happy to remove some old stuff if possible. ~I have no idea how to do so however.~ I see from your link that it goes over how to do it. I may be able to take some time next week to remove.

My quarter wraps up next week so it may actually be possible!

tlambert03 commented 1 year ago

just ran this command:

git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
  sed -n 's/^blob //p' |
  sort --numeric-sort --key=2 |
  cut -c 1-12,41- |
  $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

and got this tail

...
0bd1f1b1d34e  977KiB aicsimageio/tests/resources/s_1_t_1_c_1_z_1.czi
55dd9ddbf891  1.1MiB aicsimageio/tests/resources/s_1_t_1_c_1_z_1.tiff
5db33ed9f863  1.2MiB aicsimageio/tests/resources/s_1_t_1_c_1_z_1.ome.tiff
81071a9ef6a5  1.5MiB _modules/tifffile/tifffile.html
bbf50d5a9155  1.5MiB _modules/tifffile/tifffile.html
31f341ca5dd4  2.4MiB oldaicsimageio/tests/img/segmentation/input_1_cellWholeIndex.tiff
bc99ea4dc67a  2.6MiB presentations/2021-dask-life-sciences/presentation.ipynb
adae2ca03429  2.9MiB aicsimageio/tests/resources/example.gif
63c5e554bbd8  9.2MiB aicsimageio/tests/resources/s_3_t_1_c_3_z_5.ome.tiff
c81fe5c73f86  9.7MiB aicsimageio/tests/resources/s_1_t_10_c_3_z_1.tiff
278ab933e0a0   14MiB aicsimageio/tests/resources/s_3_t_1_c_3_z_5.czi
f7a36c40df49   15MiB aicsimageio/tests/resources/s_1_t_1_c_10_z_1.ome.tiff
e4a5c77eb02c   27MiB aicsimageio/tests/resources/variable_per_scene_dims.czi
851a737f57ae   93MiB oldaicsimageio/tests/img/segmentation/input_3_nuc_orig_img.tiff

so it might be as easy as

bfg --delete-files "{*.tiff,*.czi}"

(that syntax found here)

toloudis commented 1 year ago

What are the odds that someone will need to build and test an old version with those assets ?(famous last words) I approve.

AetherUnbound commented 1 year ago

Works for me! 😄 BFG seems like the perfect tool here 💯

evamaxfield commented 1 year ago

Okay that is approval from @toloudis and @AetherUnbound. I am running the BFG.

evamaxfield commented 1 year ago

Someone send help:

~/active/cell/aicsimageio on main [?] env base Python v3.7.12 gcloud evamaxfieldbrown@gmail.com
❯ bfg --delete-files "{*.tiff,*.czi}"

Using repo : /home/eva/active/cell/aicsimageio/.git

Found 101 objects to protect
Found 34 commit-pointing refs : HEAD, refs/heads/admin/include-fsspec-dep-for-czi-in-readme, refs/heads/main, ...
Found 42 tag-pointing refs : refs/tags/v3.2.2, refs/tags/v3.2.3, refs/tags/v3.3.0, ...

Protected commits
-----------------

These are your protected commits, and so their contents will NOT be altered:

 * commit 25d561ef (protected by 'HEAD')

Cleaning
--------

Found 979 commits
Cleaning commits:       100% (979/979)
Cleaning commits completed in 525 ms.

Updating 74 Refs
----------------

    Ref                                                              Before     After   
    ------------------------------------------------------------------------------------
    refs/heads/admin/include-fsspec-dep-for-czi-in-readme          | 97fc79fa | a6625af5
    refs/heads/main                                                | 25d561ef | 797b7ea6
    refs/remotes/origin/admin/include-fsspec-dep-for-czi-in-readme | 97fc79fa | a6625af5
    refs/remotes/origin/admin/support-py311                        | f8b01551 | f975a316
    refs/remotes/origin/benchmark-results                          | 3f8898dc | 1f07d079
    refs/remotes/origin/feature/ome-metadata-with-save             | c079764e | bccd3e3a
    refs/remotes/origin/feature/v5-proto                           | d519acd5 | 7144b252
    refs/remotes/origin/feature/zarrwriter                         | ff7f8c78 | e6cff2bd
    refs/remotes/origin/fix/imageio-2.22                           | 0ea02b29 | e279edcb
    refs/remotes/origin/fix/tiff_handle_dim_i                      | c712cdac | 977c1a63
    refs/remotes/origin/gh-pages                                   | 58e853f7 | e68e37af
    refs/remotes/origin/main                                       | 25d561ef | 797b7ea6
    refs/remotes/origin/oldaicsimageio                             | f2e52829 | 419d2465
    refs/remotes/origin/v3                                         | e8349900 | 693860ec
    refs/tags/v3.0.0                                               | 042a55f6 | 1687c21d
    ...

Updating references:    100% (74/74)
...Ref update completed in 37 ms.

Commit Tree-Dirt History
------------------------

    Earliest                                              Latest
    |                                                          |
    DDDDDDDDDDDDDDDDDDDDmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm

    D = dirty commits (file tree fixed)
    m = modified commits (commit message or parents changed)
    . = clean commits (no changes to file tree)

                            Before     After   
    -------------------------------------------
    First modified commit | ba9f557b | c87264d1
    Last dirty commit     | 2d089a9e | a217cdb2

Deleted files
-------------

    Filename                                Git id                                
    ------------------------------------------------------------------------------
    T=5_Z=3_CH=2_CZT_All_CH_per_Slice.czi | 2cdc58af (133 B )                     
    input_1_cellWholeIndex.tiff           | 31f341ca (2.4 MB)                     
    input_3_nuc_orig_img.tiff             | 851a737f (92.9 MB)                    
    s_1_t_10_c_3_z_1.tiff                 | c81fe5c7 (9.7 MB)                     
    s_1_t_1_c_10_z_1.ome.tiff             | f7a36c40 (15.1 MB)                    
    s_1_t_1_c_1_z_1.czi                   | 132c641d (132 B ), 0bd1f1b1 (977.3 KB)
    s_1_t_1_c_1_z_1.ome.tiff              | 5db33ed9 (1.2 MB)                     
    s_1_t_1_c_1_z_1.tiff                  | 55dd9ddb (1.1 MB)                     
    s_3_t_1_c_3_z_5.czi                   | 278ab933 (14.0 MB), 89fbdcdd (133 B ) 
    s_3_t_1_c_3_z_5.ome.tiff              | 63c5e554 (9.2 MB)                     
    test_5_dimension.czi                  | 42ca65a9 (132 B )                     
    variable_per_scene_dims.czi           | e4a5c77e (26.7 MB)                    

In total, 1642 object ids were changed. Full details are logged here:

    /home/eva/active/cell/aicsimageio.bfg-report/2022-12-09/11-50-17

BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive

~/active/cell/aicsimageio on main [?] env base Python v3.7.12 gcloud evamaxfieldbrown@gmail.com
❯ git reflog expire --expire=now --all && git gc --prune=now --aggressive
Enumerating objects: 30778, done.
Counting objects: 100% (30778/30778), done.
Delta compression using up to 16 threads
Compressing objects: 100% (30054/30054), done.
Writing objects: 100% (30778/30778), done.
Total 30778 (delta 24677), reused 4478 (delta 0), pack-reused 0

~/active/cell/aicsimageio on main [?] env base Python v3.7.12 gcloud evamaxfieldbrown@gmail.comtook 9s 
❯ git push --force
Enumerating objects: 3881, done.
Counting objects: 100% (3881/3881), done.
Delta compression using up to 16 threads
Compressing objects: 100% (1112/1112), done.
Writing objects: 100% (3881/3881), 6.82 MiB | 6.71 MiB/s, done.
Total 3881 (delta 2936), reused 3640 (delta 2744), pack-reused 0
remote: Resolving deltas: 100% (2936/2936), done.
remote: error: GH008: Your push referenced at least 20 unknown Git LFS objects:
remote:     54de2e71a92bdb440cd1cce476a9cd15ae42f57def6836a95d966e3be65ae628
remote:     1ea387a6eb3040fed7390ef8a6b8ba256002692827647062873ea68a24e86d9f
remote:     df9ab243a43fe0681bf4548bf40d6769893aa08c50988af4ce2f40352a5b42b2
remote:     ...
remote: Try to push them with 'git lfs push --all'.
To github.com:AllenCellModeling/aicsimageio.git
 ! [remote rejected]   main -> main (pre-receive hook declined)
error: failed to push some refs to 'github.com:AllenCellModeling/aicsimageio.git'
evamaxfield commented 1 year ago

Should I really be pushing LFS stuff???

evamaxfield commented 1 year ago

I cannot wait to move over to an entirely new repo in bioio where we don't have LFS history.

AetherUnbound commented 1 year ago

There don't seem to be a whole lot of good answers to this online 😅 we could maybe ignore the commits that involve LFS? Or would it be best to just wait until we have a fresh repo?

toloudis commented 1 year ago

Rename aicsimageio to aicsimageio-legacy. Start new aicsimageio history from current head. Problem solved!

tlambert03 commented 1 year ago

certainly don't wanna cause any undue stress here :) so feel free to put this on the backburner if desired!

toloudis commented 1 year ago

It's conceivable that the very first edition of bioio will be identical to aicsimageio but with the code separated into logical separate repositories and each reader repo would manage its own test resources. This could be a precursor to making the other intended improvements (e.g. making it easier to write a new Reader from scratch, improve some of the api etc..)

While that helps with the history/cloning problem, there is still the burden of managing potentially large stores of test resources. Especially if we consider tiff, and ome-zarr, to be "core" for bioio.

SeanLeRoy commented 1 year ago

Closing due to the upcoming release of bioio. aicsimageio is moving into "maintenance" mode where only high impact bugfixes (or community contributed) work will be done in aicsimageio. Instead of aicsimageio, we are creating a package soon to be released called bioio. See the reason for this change here.

If this issue is still relevant to anyone (@tlambert03) feel free to re-open this issue.