Open Gui-FernandesBR opened 3 days ago
Thanks for keeping the issue alive. I am still interested on applying git large file storage to this repo, but before diving in I decide to do the "home work" and spend the day looking into your documentation and also tried different install procedures. You can assign me the task if you wish!
Regards
Nice job!
I'm also trying to read more about git filter-branch and git bfg. I don't know if I will manage to find time to actually use it on this repo, I'm more of studying the tools than actually using. But if I do any progress I will let you know.
I have used git LFS at this repo, maybe there's something in that repo that could help us.
EDIT: migrating to git-lfs on github is better described by github here.
Currently the 'data' has the following distribution of files:
# All Files
RocketPy on master is 📦 v1.6.1 via 🐍 v3.12.3
❯ du -h --max-depth=0 data
162M data
# .csv files
RocketPy on master is 📦 v1.6.1 via 🐍 v3.12.3
❯ find -type f -name "*.csv" -exec du -ch {} + | grep 'total' | awk '{print $1}'
20M
# .rc files
RocketPy on master is 📦 v1.6.1 via 🐍 v3.12.3
❯ find -type f -name "*.nc" -exec du -ch {} + | grep 'total' | awk '{print $1}'
142M
There is also about 1 Mb of .csv
and some tiny .nc
files on both tests (fixtures) and docs folders. It looks like it is the case of tracking those files with git-lsf
.
Git LFS official repo has a tutorial on how to migrate a repository.
Problems I envision to implement this:
Alternative:
Those were the investigations for today, as soon as I implement git-lfs on my repo I will bring you the numbers and more discussions, if needed.
Nice job!
I'm also trying to read more about git filter-branch and git bfg. I don't know if I will manage to find time to actually use it on this repo, I'm more of studying the tools than actually using. But if I do any progress I will let you know.
I have used git LFS at this repo, maybe there's something in that repo that could help us.
Just now I actually read your comment. I will look into the tools you mentioned, the paper repo seems to be a different case (did you migrate?). I think the main problem is doing the migration and coordinating with everyone else to use git-lfs.
@aureliobarbosa I have to say I liked the idea of trying the alternative solution first, and then we can try the git LFS. Everything in the data folder is now "definitive", and those files which we deleted we may never need to restore again.
Hey @Gui-FernandesBR,
I agree with you about trying cleaning the git history. In this direction, I evaluated tools for cleaning the git history and found that git-filter-repo seems to be a better solution. It has an option to analyze the size of previously deleted files, folders and files by extension (inside git history). Since I am supposing that you going to keep versions of data files inside the repository, I opted to investigate the sizes by file, while sorting them in reverse order. Below is a snapshot of CSV file I generated (I will send it to the team via Discord).
Contrary to initial expectation, only a few big CSV files are stored in the git history and the villains include .nc files, as expected, and notebooks (which store data inside it, of course...). The first big .py
is the number 22 appearing on the list and has about 9 Mb (the first version of RocketPy?). The second .py
appearing on this list is file number 51. By excluding 49 files on this list you would reduce the size of your git repo by 660Mb. It is important to remember that this files have been deleted from the tip of the main branch.
Considering this, my recommendation would be to delete only those 49 files, since this seems the simplest action to be done. After installing the git-filter-repo
this operations can be easily done by putting the undesired files in a single file and running:
git filter-repo --invert-paths --paths-from-file files-i-dont-want-anymore.txt
Note that it DOES rewrite the git history and all developers would need to clone the repository again. I recommend to do this 'surgery' when you finish all PRs you expect to do before the next minor version.
mycode/projects/rocketpy-dev
❯ more path-deleted-sizes.csv
1, 46978562, 16949319, 2019-02-07, 'docs/sampleDispersionDataReader.ipynb'
2, 46978562, 16949319, 2019-02-07, 'disp/sampleDispersionDataReader.ipynb'
3, 46902996, 40188650, 2023-01-01, 'data/weather/Alcantara_2016_ERA-5.nc'
4, 46774852, 40263054, 2023-01-01, 'data/weather/Alcantara_2017_ERA-5.nc'
5, 46774852, 40126497, 2023-01-01, 'data/weather/Alcantara_2015_ERA-5.nc'
6, 46774852, 40116024, 2023-01-01, 'data/weather/Alcantara_2018_ERA-5.nc'
7, 43664596, 36926254, 2022-09-24, 'data/weather/EuroC_single_level_reanalysis_2000_2021.nc'
8, 21323940, 18974499, 2023-01-01, 'data/weather/CLBI_2016_ERA-5.nc'
9, 21323936, 19420900, 2023-01-01, 'data/weather/SpaceportAmerica_2016_ERA-5.nc'
10, 21265684, 18923920, 2023-01-01, 'data/weather/CLBI_2018_ERA-5.nc'
11, 21265684, 18904131, 2023-01-01, 'data/weather/CLBI_2017_ERA-5.nc'
12, 21265680, 19476628, 2023-01-01, 'data/weather/SpaceportAmerica_2017_ERA-5.nc'
13, 21265680, 19373762, 2023-01-01, 'data/weather/SpaceportAmerica_2015_ERA-5.nc'
14, 21265680, 18918037, 2023-01-01, 'data/weather/CLBI_2015_ERA-5.nc'
15, 13753819, 4050298, 2024-09-21, 'docs/notebooks/fins_roll.csv'
16, 12908689, 2940044, 2024-09-21, 'docs/notebooks/coeff_testing.ipynb'
17, 12894004, 3692363, 2020-03-22, 'nbks/Dispersion Sample.disp_input'
18, 12355867, 3862899, 2021-04-07, 'docs/notebooks/valetudo_dispersion/valetudo_dispersion.ipynb'
19, 11866021, 4080483, 2024-08-04, 'docs/notebooks/airbrakes_example.ipynb'
20, 10830351, 3809773, 2020-03-22, 'nbks/Getting Started - Examples.ipynb'
21, 9860124, 3218580, 2021-04-07, 'docs/notebooks/dispersion_analysis.ipynb'
22, 9802609, 123010, 2020-03-22, 'nbks/rocketpyAlpha.py'
23, 9005216, 8966518, 2022-09-24, 'data/weather/EuroC_pressure_levels_reanalysis_2002-2021.nc'
24, 8589588, 4696361, 2024-08-04, 'docs/notebooks/air_brakes_example.ipynb'
25, 8232142, 2572448, 2023-08-10, 'docs/notebooks/example_hybrid.ipynb'
26, 7849765, 2155194, 2021-04-07, 'docs/notebooks/valetudo_dispersion/Monte_carlo_valetudo.valetudo_disp_o
ut.txt'
27, 6574758, 4701393, 2024-08-03, 'docs/notebooks/environment/environment_class_usage.ipynb'
28, 6313322, 3514941, 2023-06-28, 'docs/notebooks/example_solid.ipynb'
29, 6054970, 1531514, 2020-03-22, 'nbks/Dispersion Sample.disp_output'
30, 5635005, 3817378, 2020-03-22, 'nbks/Environment - Examples.ipynb'
31, 5080208, 4834400, 2022-04-09, 'data/weather/spaceport_america_pressure_level_reanalysis_2015_2021.nc'
32, 4976750, 2880500, 2022-06-07, 'docs/notebooks/SolidMotor_class_usage.ipynb'
33, 4929275, 3217437, 2020-03-22, 'nbks/Dispersion Analysis - Monte Carlo - Example.ipynb'
34, 4782068, 2288881, 2023-08-10, 'docs/notebooks/tank_class_usage.ipynb'
35, 4712149, 3155118, 2019-02-07, 'nbks/Environment Examples.ipynb'
36, 4299000, 215, 2024-09-21, 'docs/notebooks/tail_cL.csv'
37, 4298998, 1038596, 2024-09-21, 'docs/notebooks/tail_cQ.csv'
38, 4273149, 215, 2024-09-21, 'docs/notebooks/nose_cL.csv'
39, 4273147, 1036268, 2024-09-21, 'docs/notebooks/nose_cQ.csv'
40, 4223009, 213, 2024-09-21, 'docs/notebooks/fins_cL.csv'
41, 4223007, 1030363, 2024-09-21, 'docs/notebooks/fins_cQ.csv'
42, 4082870, 4044416, 2022-09-22, 'data/weather/EuroC_pressure_levels_reanalysis_2002_2010.nc'
43, 3418375, 1328400, 2023-08-10, 'docs/notebooks/example_liquid.ipynb'
44, 3274830, 1609382, 2020-03-22, 'nbks/Euporia.ipynb'
45, 3211198, 2261130, 2022-10-10, 'getting_started Dispersion.ipynb'
46, 2695115, 2695950, 2023-09-25, 'docs/static/trajectory-earth.png'
47, 2174231, 911552, 2022-05-19, 'getting_started.ipynb'
48, 2138702, 771662, 2023-01-01, 'data/calisto/CD Test.CSV'
49, 2109965, 836873, 2023-01-01, 'data/euporia/euporiaIDrag.csv'
50, 1933436, 965847, 2018-12-11, 'nbks/Calisto.ipynb'
51, 1525754, 51968, 2024-07-03, 'tests/test_rocket.py'
Amazing work, @aureliobarbosa ! I believe we can move forward with the git-filter-repo in order to significantly reduce the repo size (probably by half!).
I was not imagining that .ipynb would also be a part of the "villains list", but it makes total sense! When we save the notebooks with images, the ipython interpreter has to convert the image to a hash and store it in the .ipynb file (wich is just a fancy .json), this may consume disk space. Found another reason to migrate .ipynb to .rst files @MateusStano @phmbressan @Lucas-Prates !
@RocketPy-Team/code-owners can you read this thread and let us know that you agree with such operation?
The only concern is that a few files are still being used, therefore cannot be deleted:
Something we should definitely try is to compress the .nc files! Based on my experience, there are some free tools that compress these files, usually reducing the file size by 30%.
With all that been said, I guess a good summary of next steps would be:
As of now, I think your contribution is already quite beneficial for us, @aureliobarbosa ! I will discuss with the other code owners during our next weekly meetings to coordinate the best time to make the step 3 happen. Feel free to work on another meanwhile (let me know if you need any suggestions).
Is your feature request related to a problem? Please describe.
As discussed here by @aureliobarbosa, cloning the RocketPy repo currently consumes more more than 1GB. This is probably due to large files being stored
Describe the solution you'd like
There are a few options that we would like to explore in order to tackle this issue. For instance:
.nc
and other binary files that were initially committed to this repo but at some point got deleted.data
folder.Additional context
I have no much experience on this, but I will try listing a few links that may help us.