Closed zhang-alvin closed 4 years ago
One thing we can do is move the notebooks directory to a standalone repo and add it back as a submodule. The idea is for air-water-vv to continue being a repository of test problems that travis will run, so in the future we can migrate more of the tests that require data files to air-water-vv as well.
I think your main point is that running bfg is a matter of timing, so we may want to discuss on Wednesday a timeline.
Perhaps use erdc/proteus_old; through repository copy
The size of the current repo including git-lfs files from a raw clone:
du -h -d1 | sort -h
4.0K ./air-water-vv
4.0K ./stack
548K ./doc
604K ./scripts
155M ./notebooks
189M ./proteus
667M ./.git
1012M .
with git lfs fetch --all
, this increases:
du -h -d1 | sort -h
4.0K ./air-water-vv
4.0K ./stack
548K ./doc
604K ./scripts
155M ./notebooks
189M ./proteus
871M ./.git
1.2G .
Current size based on website:
count: 86
size: 676.00 KiB
in-pack: 111953
packs: 2
size-pack: 389.25 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes
Output from git-sizer
tool:
Processing blobs: 80798
Processing trees: 24033
Processing commits: 7151
Matching commits to trees: 7151
Processing annotated tags: 4
Processing references: 493
| Name | Value | Level of concern |
| ---------------------------- | --------- | ------------------------------ |
| Biggest objects | | |
| * Commits | | |
| * Maximum size [1] | 93.9 KiB | * |
| * Trees | | |
| * Maximum entries [2] | 1.30 k | * |
| * Blobs | | |
| * Maximum size [3] | 94.1 MiB | ********* |
| | | |
| Biggest checkouts | | |
| * Maximum path depth [4] | 12 | * |
| * Maximum path length [5] | 166 B | * |
| * Number of symlinks [6] | 29.9 k | * |
Sizes of largest 20 files:
All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file.
size pack SHA location
96377 7554 7d7308e36e5160f2653e538b25b6238c5a116051 proteusModule/test/problemDescriptions/runAll.out
79592 79457 16b89a8c58fca6367ee20cff37ae59e701be279b notebooks/Presentations/FEMDEM/cylinder.gif
68959 472 bce832332226e06fcef3f5436df90a12a2ca367c proteus/tests/solver_tests/comparison_files/A_mat.npy
66510 66530 9d4beba5d8c213ecc7979e27604bbc02cef2e73e notebooks/Presentations/FEMDEM/dambreakib.gif
20985 8874 432df7b3d07f938b27e04125819f1688ebdba402 proteus/tests/amg_tests/import_modules/saddle_point_mat_1
20985 10275 95e4011ffb6d0bccbdc9291170b93efaaf860652 proteus/tests/solver_tests/import_modules/rans2p_step_newton_5.bin
18343 6511 fdcb3cbd69e0ce63134f61bc5077a01ac14b88fe solitarySlope2/pastRuns/slope/mesh.neigh
14202 5226 596f1e7be6aba15c800a02af6e00fbc6497455f6 proteus/tests/yy_practice/test_ls_burgers_PG_ALE/burgers_tri_be_64_1_ls.h5
13971 3962 6b1799466f93171e92f521da4cd148e7e8621d93 solitarySlope2/pastRuns/slope/mesh.edge
12979 1606 fa5edb1220c22fd7fe89e987dd0a264c72b7f6a3 RANS2P2D.h.gch
11190 7344 1617609372208c51c13c4229d390421adafe4d3a proteus/tests/amg_tests/import_modules/bcw_matrix_1
8936 4706 a1f41b402ec9cdf0b4855fc7fd0c31eaf28c1b47 proteus/tests/griffiths_lane_6/elastoplastic_expected.h5
7489 3216 bf5bacd30a457d4a1c5bdd5b609fef9d356551f3 proteus/tests/levelset/vortex/vortex_c0p1cg_bdf_2_level_1_expected.h5
5282 2587 039216719c6124e91c6fa11b316abfacce3b3c66 proteus/tests/solver_tests/import_modules/NSE_step_no_slip.bin
5251 564 f586f908ab07d7fa79d2c40476d02e02bd2c8d5e capi/html/_a_d_r_8cpp_source.html
4917 2076 84c98c5eebce02b3dd2e6f0f602fb09c117a4e79 proteus/tests/cylinder2D/conforming_rans2p/comparison_files/T1_rans2p.h5
4687 455 aff6bb31712741c735d5dfcf7530ce5310d3001f proteus/mbd/ChRigidBody.cpp
4677 541 977d014f6ee7600cb9cd38c22f4a241406dec2db capi/html/_wave_tools_8c_source.html
4479 2028 3ff0a9d577c23de1fa6ebeae6b581c21ef4c8c1b proteus/tests/cylinder2D/conforming_rans3p/comparison_files/T8P2.h5
4424 811 ebe1ecda5d70aca2788e5063c7cf8eda8cc83c2e doctrees/environment.pickle
I've created a cleaned copy of the repo with only the master branch existing at https://github.com/zhang-alvin/cleanProteus
. The size of a raw clone and git-lfs fetch --all
is:
du -h -d1 | sort -h
4.0K ./air-water-vv
4.0K ./stack
548K ./doc
604K ./scripts
2.6M ./notebooks
76M ./proteus
159M ./.git
238M .
For reference, the cleaned repo was made through the following steps:
git clone --bare git@github.com:erdc/proteus.git
cd proteus.git
git lfs fetch --all
cd ..
java -jar bfg --delete-files "{insert filenames}" --no-blob-protection proteus.git/
git reflog expire --expire=now --all && git gc --prune=now --aggressive
java -jar bfg --convert-to-git-lfs "{insert filenames}" --no-blob-protection proteus.git/
git reflog expire --expire=now --all && git gc --prune=now --aggressive
export REMOTE_NAME=origin
git branch -r | grep "${REMOTE_NAME}/" | grep -v 'master$' | grep -v HEAD | sed -E "s/^[[:space:]]*${REMOTE_NAME}\///g" | while read line; do git push $REMOTE_NAME :heads/$line; done;
git push newremote
git lfs push --all
There is an assumption that large files (>1MB) that are not referenced by the master branch are to be removed, and those that are need to be tracked by lfs.
Looks like over a factor of 4 reduction, right? Does it seem like a reasonable plan to do one more release on the 1.7 branch with the history as is, then copy that repository to proteus-old, then run your cleaning commands? I suppose we should try to close out as many branches as possible before doing the history rewrite. Since those old branches would be intact on the backed up (old) repository, it wouldn't necessarily need a lot of coordination.
Yes, it's quite a drastic reduction in size since I remove any file (object) that is not referenced by the master branch. Assuming none of the previous releases had any true dependencies on such large files, then it might be possible to preserve the various releases as branches on the repo.
Regarding the number of branches, I think it would be best to do the following:
This doesn't prevent us from pruning/closing some of the older branches before step 1, but it becomes a matter of knowing which branches to close. It is important that everyone makes a fork of proteus afterward so that the actively developed branches have the same history as master
.
That sounds good. How about this for some of the details on how to get this finished:
@cekees for the documentation, it might actually be easier to bring it back on the main repo if we want to build it automatically (see #1040)
Just an updated roadmap for this issue:
Merge in #1039, #1040, #1052.
Then I'll go and duplicate/clone the repo into proteus_old
and go ahead and remove all files larger than 25-50kb that does not currently exist in the master branch. Any necessary files will be added back after the purge.
Below is a list of the files + file sizes in the repo in descending order. The first column denotes the size in kB, the second the packed size in kB, the third the SHA of the file:
As one might expect, there are a number of files from the tests that need to be removed. There are also other types of files:
12979 1606 fa5edb1220c22fd7fe89e987dd0a264c72b7f6a3 RANS2P2D.h.gch 4687 455 aff6bb31712741c735d5dfcf7530ce5310d3001f proteus/mbd/ChRigidBody.cpp 4424 811 ebe1ecda5d70aca2788e5063c7cf8eda8cc83c2e doctrees/environment.pickle
etc.
Removing the first 150 or so files would shrink the repo 70-80%.
Separate notebooks repo through the following instructions: https://medium.com/@ayushya/move-directory-from-one-repository-to-another-preserving-git-history-d210fa049d4b
@zhanga is that for files that do not exist anymore on the latest master? Would be nice to get a 70-80% shrinkage! There are a few files .cpp and .c that are autogenerated code in the list, like ChRigidBody.cpp or WaveTools.cpp, so these can go
@tridelat These are just the largest files in the repository, not just on the latest master. Part of the difficulty is that it's not clear sometimes which .cpp files are source code or cythonized/autogenerated code, which is why I'd go for a "remove top 150 + .h5 files" approach.
It doesn't look like there's much stuff in the current doc
directory. Are there any large documentation files that should be moved out of the repo? And I presume things like capi/html/_wave_tools_8c_source.html
were also affiliated with documentation from an older approach?
I'm OK with just checking if the large cpp files are actually in the current master.
@zhang-alvin yes that's what I thought, I meant to ask if it was the top 150 files that are anywhere in the repo but not in the latest master.
The docs
directory only contains what is necessary to build the docs, but not the built docs itself, those are in the gh-pages
branch (and going to move to a separate repo soon)
List of files to be deleted. The cleaning process doesn't look for unique paths and instead simply matches filenames for deletion. The result is that some tests will fail because some files share names with those being removed from history. Such files will be added back after the fact.
The first column is the size in bytes (i.e. first item is about 98MB)
The repo was cloned with git clone --mirror
and had a size of about 1.3 GB. The notebooks, capi, doctrees, api, _sources, _images, externalPackages
were then removed. This was done to make the list of largest files simpler to understand while presumably not removing any core functionality/source code.
The criteria for choosing the remaining files were:
1) for files larger than 100kB, remove unless it's source code 2) for files less than 100kB but larger than 10kB, remove unless it's source code (.c,.h,.py,.pyx,.cpp) or testing related
Some .cpp files are actually cython-outputted files. The resulting files are largely .h5, .ipynb, .bin, .dat, .txt, mesh, and html files.
I tried looking into removing individual git blobs, but that didn't play well with git lfs for an unknown reason.
Cleaned mirror clone pushed to this repo. The original mirror clone + git lfs fetch --all was 1.3 GB. The cleaned repo mirror clone with git lfs fetch --all is 444 MB. There's a possibility of additional size reduction when more branches are deleted.
Users will see an even smaller footprint as a regular git clone + git lfs fetch --all yields only a 93 MB directory.
List of files deleted from second pass of cleanup via bfg. The repo size seems to have increased since the initial cleanup - possibly additional branches. It is currently at 96MB with git lfs fetch --all. Removing the listed files also didn't seem to yield the size reduction according to the size of the files. That is, the removal of the files totaling to 20MB seemed to ultimately affect the repo size by 4MB in before and after tests.
Closing the issue now as most goals were accomplished.
The repo size is still unnecessarily large. I had previously attempted to migrate image and result files (e.g. h5, png, sms) to
git-lfs
withgit filter-branch
, but it doesn't look like it that was thorough enough.The size of the repo can be seen in kB under the size tag : https://api.github.com/repos/erdc/proteus
Alternatively, one can check with
git count-objects -vH
which results in:Looking at the 10 largest files in the repo's history, we see several image files, result (h5) files, and mesh (sms) files: (obtained from this script)
There's a tool that can migrate any file of various types in the entire history of the repo called bfg. Running the following yielded a significantly smaller size for the repo:
The downside of this massive lfs migration is that the history will be modified. To avoid a mess of conflicting histories I think the following would need to be done:
git push --force