erdc / proteus

A computational methods and simulation toolkit
http://proteustoolkit.org
MIT License
88 stars 56 forks source link

Repo size > 250MB #713

Closed zhang-alvin closed 4 years ago

zhang-alvin commented 6 years ago

The repo size is still unnecessarily large. I had previously attempted to migrate image and result files (e.g. h5, png, sms) to git-lfs with git filter-branch, but it doesn't look like it that was thorough enough.

The size of the repo can be seen in kB under the size tag : https://api.github.com/repos/erdc/proteus

Alternatively, one can check with git count-objects -vH which results in:

count: 0
size: 0 bytes
in-pack: 60054
packs: 1
size-pack: 341.65 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

Looking at the 10 largest files in the repo's history, we see several image files, result (h5) files, and mesh (sms) files: (obtained from this script)

size   pack   SHA                                       location
96377  7554   7d7308e36e5160f2653e538b25b6238c5a116051  proteusModule/test/problemDescriptions/runAll.out
79592  79457  16b89a8c58fca6367ee20cff37ae59e701be279b  notebooks/Presentations/FEMDEM/cylinder.gif
68959  472    bce832332226e06fcef3f5436df90a12a2ca367c  proteus/tests/solver_tests/comparison_files/A_mat.npy
66510  66530  9d4beba5d8c213ecc7979e27604bbc02cef2e73e  notebooks/Presentations/FEMDEM/dambreakib.gif
17281  5335   c8b16ebf26dfc4fc5e70ff3bab1d12d46c7e7181  proteus/MeshAdaptPUMI/test/splash-cube/4-Procs/Splashcube.sms
14202  5226   596f1e7be6aba15c800a02af6e00fbc6497455f6  proteus/tests/yy_practice/test_ls_burgers_PG_ALE/burgers_tri_be_64_1_ls.h5
12979  1606   fa5edb1220c22fd7fe89e987dd0a264c72b7f6a3  RANS2P2D.h.gch
11804  3286   81694ffe6f17aac11ee5f8aeb17fd8930087239b  proteus/MeshAdaptPUMI/test/splash-cube/1-Proc/pumi_adapt0.vtu
11516  2746   4a00d868da869269e5bd6063e68aee98e6f304d6  proteus/MeshAdaptPUMI/test/simmetrixDambreak/1-Proc/Dambreak0.smb
11047  952    de681ddd517315ae23c99d4cd92b22985a6397e3  src/MeshAdaptPUMI/test/dambreak/4-Procs/Log-Case-2

There's a tool that can migrate any file of various types in the entire history of the repo called bfg. Running the following yielded a significantly smaller size for the repo:

java -jar bfg-1.12.16.jar --convert-to-git-lfs "*.{gif,out,mp4,sms,smd,h5,smb,vtu,db,tst,png,ipynb,msh,jpg}" --no-blob-protection proteus.git
git reflog expire --expire=now --all && git gc --prune=now --aggressive
count: 0
size: 0 bytes
in-pack: 59861
packs: 1
size-pack: 53.99 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

The downside of this massive lfs migration is that the history will be modified. To avoid a mess of conflicting histories I think the following would need to be done:

  1. all users/developers would need to push all commits and branches to the repo
  2. run the bfg utility and git push --force
  3. everybody clones the new repo to resume development
cekees commented 6 years ago

One thing we can do is move the notebooks directory to a standalone repo and add it back as a submodule. The idea is for air-water-vv to continue being a repository of test problems that travis will run, so in the future we can migrate more of the tests that require data files to air-water-vv as well.

I think your main point is that running bfg is a matter of timing, so we may want to discuss on Wednesday a timeline.

zhang-alvin commented 4 years ago

Perhaps use erdc/proteus_old; through repository copy

zhang-alvin commented 4 years ago

The size of the current repo including git-lfs files from a raw clone:

du -h -d1 | sort -h
4.0K    ./air-water-vv
4.0K    ./stack
548K    ./doc
604K    ./scripts
155M    ./notebooks
189M    ./proteus
667M    ./.git
1012M   .

with git lfs fetch --all, this increases:

du -h -d1 | sort -h
4.0K    ./air-water-vv
4.0K    ./stack
548K    ./doc
604K    ./scripts
155M    ./notebooks
189M    ./proteus
871M    ./.git
1.2G    .

Current size based on website:

count: 86
size: 676.00 KiB
in-pack: 111953
packs: 2
size-pack: 389.25 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

Output from git-sizer tool:

Processing blobs: 80798                        
Processing trees: 24033                        
Processing commits: 7151                        
Matching commits to trees: 7151                        
Processing annotated tags: 4                        
Processing references: 493                        
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Biggest objects              |           |                                |
| * Commits                    |           |                                |
|   * Maximum size         [1] |  93.9 KiB | *                              |
| * Trees                      |           |                                |
|   * Maximum entries      [2] |  1.30 k   | *                              |
| * Blobs                      |           |                                |
|   * Maximum size         [3] |  94.1 MiB | *********                      |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Maximum path depth     [4] |    12     | *                              |
| * Maximum path length    [5] |   166 B   | *                              |
| * Number of symlinks     [6] |  29.9 k   | *                              |

Sizes of largest 20 files:

All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file.
size   pack   SHA                                       location
96377  7554   7d7308e36e5160f2653e538b25b6238c5a116051  proteusModule/test/problemDescriptions/runAll.out
79592  79457  16b89a8c58fca6367ee20cff37ae59e701be279b  notebooks/Presentations/FEMDEM/cylinder.gif
68959  472    bce832332226e06fcef3f5436df90a12a2ca367c  proteus/tests/solver_tests/comparison_files/A_mat.npy
66510  66530  9d4beba5d8c213ecc7979e27604bbc02cef2e73e  notebooks/Presentations/FEMDEM/dambreakib.gif
20985  8874   432df7b3d07f938b27e04125819f1688ebdba402  proteus/tests/amg_tests/import_modules/saddle_point_mat_1
20985  10275  95e4011ffb6d0bccbdc9291170b93efaaf860652  proteus/tests/solver_tests/import_modules/rans2p_step_newton_5.bin
18343  6511   fdcb3cbd69e0ce63134f61bc5077a01ac14b88fe  solitarySlope2/pastRuns/slope/mesh.neigh
14202  5226   596f1e7be6aba15c800a02af6e00fbc6497455f6  proteus/tests/yy_practice/test_ls_burgers_PG_ALE/burgers_tri_be_64_1_ls.h5
13971  3962   6b1799466f93171e92f521da4cd148e7e8621d93  solitarySlope2/pastRuns/slope/mesh.edge
12979  1606   fa5edb1220c22fd7fe89e987dd0a264c72b7f6a3  RANS2P2D.h.gch
11190  7344   1617609372208c51c13c4229d390421adafe4d3a  proteus/tests/amg_tests/import_modules/bcw_matrix_1
8936   4706   a1f41b402ec9cdf0b4855fc7fd0c31eaf28c1b47  proteus/tests/griffiths_lane_6/elastoplastic_expected.h5
7489   3216   bf5bacd30a457d4a1c5bdd5b609fef9d356551f3  proteus/tests/levelset/vortex/vortex_c0p1cg_bdf_2_level_1_expected.h5
5282   2587   039216719c6124e91c6fa11b316abfacce3b3c66  proteus/tests/solver_tests/import_modules/NSE_step_no_slip.bin
5251   564    f586f908ab07d7fa79d2c40476d02e02bd2c8d5e  capi/html/_a_d_r_8cpp_source.html
4917   2076   84c98c5eebce02b3dd2e6f0f602fb09c117a4e79  proteus/tests/cylinder2D/conforming_rans2p/comparison_files/T1_rans2p.h5
4687   455    aff6bb31712741c735d5dfcf7530ce5310d3001f  proteus/mbd/ChRigidBody.cpp
4677   541    977d014f6ee7600cb9cd38c22f4a241406dec2db  capi/html/_wave_tools_8c_source.html
4479   2028   3ff0a9d577c23de1fa6ebeae6b581c21ef4c8c1b  proteus/tests/cylinder2D/conforming_rans3p/comparison_files/T8P2.h5
4424   811    ebe1ecda5d70aca2788e5063c7cf8eda8cc83c2e  doctrees/environment.pickle
zhang-alvin commented 4 years ago

I've created a cleaned copy of the repo with only the master branch existing at https://github.com/zhang-alvin/cleanProteus. The size of a raw clone and git-lfs fetch --all is:

du -h -d1 | sort -h
4.0K    ./air-water-vv
4.0K    ./stack
548K    ./doc
604K    ./scripts
2.6M    ./notebooks
76M ./proteus
159M    ./.git
238M    .

For reference, the cleaned repo was made through the following steps:

git clone --bare git@github.com:erdc/proteus.git
cd proteus.git
git lfs fetch --all
cd ..
java -jar bfg --delete-files "{insert filenames}" --no-blob-protection proteus.git/
git reflog expire --expire=now --all && git gc --prune=now --aggressive
java -jar bfg --convert-to-git-lfs "{insert filenames}" --no-blob-protection proteus.git/
git reflog expire --expire=now --all && git gc --prune=now --aggressive

export REMOTE_NAME=origin
git branch -r | grep "${REMOTE_NAME}/" | grep -v 'master$' | grep -v HEAD | sed -E "s/^[[:space:]]*${REMOTE_NAME}\///g" | while read line; do git push $REMOTE_NAME :heads/$line; done;

git push newremote
git lfs push --all

There is an assumption that large files (>1MB) that are not referenced by the master branch are to be removed, and those that are need to be tracked by lfs.

cekees commented 4 years ago

Looks like over a factor of 4 reduction, right? Does it seem like a reasonable plan to do one more release on the 1.7 branch with the history as is, then copy that repository to proteus-old, then run your cleaning commands? I suppose we should try to close out as many branches as possible before doing the history rewrite. Since those old branches would be intact on the backed up (old) repository, it wouldn't necessarily need a lot of coordination.

zhang-alvin commented 4 years ago

Yes, it's quite a drastic reduction in size since I remove any file (object) that is not referenced by the master branch. Assuming none of the previous releases had any true dependencies on such large files, then it might be possible to preserve the various releases as branches on the repo.

Regarding the number of branches, I think it would be best to do the following:

  1. Copy repo to proteus-old
  2. Clean history and push to proteus
  3. Have everyone fork proteus
  4. Delete branches on main proteus repo

This doesn't prevent us from pruning/closing some of the older branches before step 1, but it becomes a matter of knowing which branches to close. It is important that everyone makes a fork of proteus afterward so that the actively developed branches have the same history as master.

cekees commented 4 years ago

That sounds good. How about this for some of the details on how to get this finished:

  1. Get the current PR's merged in (except the TwoPhaseFlow refactoring, which I'm guessing @zhang-alvin you can close and reopen later). @zhang-alvin maybe you could take a crack at some of the final tweaks to the conda package to get all the tests passing--I made a few here: https://github.com/erdc/proteus/pull/1039, but I think @davidbrochart is traveling a while longer). @tridelat can probably help with the others--think @ejtovar just needs help with test file paths and @adimako just needs to check a few things on normal direction/distance conventions.
  2. @tridelat cut the next release (go ahead and switch to your new documentation approach--I'll deactivate proteustoolkit.org on the proteus repo and point it to the new docs repo)
  3. @jhcollins and @zhang-alvin roll out modules for that release on the HPC's
  4. Once that release is done, then do your 1-4 above. If that generates a new commit of master, then you could go ahead and do another release on the 1.7 branch.
tridelat commented 4 years ago

@cekees for the documentation, it might actually be easier to bring it back on the main repo if we want to build it automatically (see #1040)

zhang-alvin commented 4 years ago

Just an updated roadmap for this issue:

Merge in #1039, #1040, #1052.

Then I'll go and duplicate/clone the repo into proteus_old and go ahead and remove all files larger than 25-50kb that does not currently exist in the master branch. Any necessary files will be added back after the purge.

zhang-alvin commented 4 years ago

Below is a list of the files + file sizes in the repo in descending order. The first column denotes the size in kB, the second the packed size in kB, the third the SHA of the file:

fileSizes.txt

As one might expect, there are a number of files from the tests that need to be removed. There are also other types of files:

12979 1606 fa5edb1220c22fd7fe89e987dd0a264c72b7f6a3 RANS2P2D.h.gch 4687 455 aff6bb31712741c735d5dfcf7530ce5310d3001f proteus/mbd/ChRigidBody.cpp 4424 811 ebe1ecda5d70aca2788e5063c7cf8eda8cc83c2e doctrees/environment.pickle

etc.

Removing the first 150 or so files would shrink the repo 70-80%.

zhang-alvin commented 4 years ago

Separate notebooks repo through the following instructions: https://medium.com/@ayushya/move-directory-from-one-repository-to-another-preserving-git-history-d210fa049d4b

tridelat commented 4 years ago

@zhanga is that for files that do not exist anymore on the latest master? Would be nice to get a 70-80% shrinkage! There are a few files .cpp and .c that are autogenerated code in the list, like ChRigidBody.cpp or WaveTools.cpp, so these can go

zhang-alvin commented 4 years ago

@tridelat These are just the largest files in the repository, not just on the latest master. Part of the difficulty is that it's not clear sometimes which .cpp files are source code or cythonized/autogenerated code, which is why I'd go for a "remove top 150 + .h5 files" approach.

It doesn't look like there's much stuff in the current doc directory. Are there any large documentation files that should be moved out of the repo? And I presume things like capi/html/_wave_tools_8c_source.html were also affiliated with documentation from an older approach?

cekees commented 4 years ago

I'm OK with just checking if the large cpp files are actually in the current master.

tridelat commented 4 years ago

@zhang-alvin yes that's what I thought, I meant to ask if it was the top 150 files that are anywhere in the repo but not in the latest master.

The docs directory only contains what is necessary to build the docs, but not the built docs itself, those are in the gh-pages branch (and going to move to a separate repo soon)

zhang-alvin commented 4 years ago

finalList_tests.txt

List of files to be deleted. The cleaning process doesn't look for unique paths and instead simply matches filenames for deletion. The result is that some tests will fail because some files share names with those being removed from history. Such files will be added back after the fact.

The first column is the size in bytes (i.e. first item is about 98MB)

The repo was cloned with git clone --mirror and had a size of about 1.3 GB. The notebooks, capi, doctrees, api, _sources, _images, externalPackages were then removed. This was done to make the list of largest files simpler to understand while presumably not removing any core functionality/source code.

The criteria for choosing the remaining files were:

1) for files larger than 100kB, remove unless it's source code 2) for files less than 100kB but larger than 10kB, remove unless it's source code (.c,.h,.py,.pyx,.cpp) or testing related

Some .cpp files are actually cython-outputted files. The resulting files are largely .h5, .ipynb, .bin, .dat, .txt, mesh, and html files.

I tried looking into removing individual git blobs, but that didn't play well with git lfs for an unknown reason.

zhang-alvin commented 4 years ago

Cleaned mirror clone pushed to this repo. The original mirror clone + git lfs fetch --all was 1.3 GB. The cleaned repo mirror clone with git lfs fetch --all is 444 MB. There's a possibility of additional size reduction when more branches are deleted.

Users will see an even smaller footprint as a regular git clone + git lfs fetch --all yields only a 93 MB directory.

zhang-alvin commented 4 years ago

secondPass_deleteList.txt

List of files deleted from second pass of cleanup via bfg. The repo size seems to have increased since the initial cleanup - possibly additional branches. It is currently at 96MB with git lfs fetch --all. Removing the listed files also didn't seem to yield the size reduction according to the size of the files. That is, the removal of the files totaling to 20MB seemed to ultimately affect the repo size by 4MB in before and after tests.

zhang-alvin commented 4 years ago

Closing the issue now as most goals were accomplished.