RocketPy-Team / RocketPy

Next generation High-Power Rocketry 6-DOF Trajectory Simulation
https://docs.rocketpy.org/
MIT License
642 stars 157 forks source link

MNT: reduce repo size #727

Open Gui-FernandesBR opened 3 days ago

Gui-FernandesBR commented 3 days ago

Is your feature request related to a problem? Please describe.

As discussed here by @aureliobarbosa, cloning the RocketPy repo currently consumes more more than 1GB. This is probably due to large files being stored

Describe the solution you'd like

There are a few options that we would like to explore in order to tackle this issue. For instance:

  1. Delete old, unused files from the git history. This could include .nc and other binary files that were initially committed to this repo but at some point got deleted.
  2. Use git large file system to store files that are too heavy (>10MB), specially those in the data folder.

Additional context

I have no much experience on this, but I will try listing a few links that may help us.

aureliobarbosa commented 3 days ago

Thanks for keeping the issue alive. I am still interested on applying git large file storage to this repo, but before diving in I decide to do the "home work" and spend the day looking into your documentation and also tried different install procedures. You can assign me the task if you wish!

Regards

Gui-FernandesBR commented 3 days ago

Nice job!

I'm also trying to read more about git filter-branch and git bfg. I don't know if I will manage to find time to actually use it on this repo, I'm more of studying the tools than actually using. But if I do any progress I will let you know.

I have used git LFS at this repo, maybe there's something in that repo that could help us.

aureliobarbosa commented 3 days ago

EDIT: migrating to git-lfs on github is better described by github here.

Currently the 'data' has the following distribution of files:

# All Files
RocketPy on master is 📦 v1.6.1 via 🐍 v3.12.3 
❯ du -h --max-depth=0 data
162M    data

# .csv files
RocketPy on master is 📦 v1.6.1 via 🐍 v3.12.3 
❯ find -type f -name "*.csv" -exec du -ch {} + | grep 'total' | awk '{print $1}'
20M

# .rc files
RocketPy on master is 📦 v1.6.1 via 🐍 v3.12.3 
❯ find -type f -name "*.nc" -exec du -ch {} + | grep 'total' | awk '{print $1}'
142M

There is also about 1 Mb of .csv and some tiny .nc files on both tests (fixtures) and docs folders. It looks like it is the case of tracking those files with git-lsf.

Git LFS official repo has a tutorial on how to migrate a repository.

Problems I envision to implement this:

Alternative:

Those were the investigations for today, as soon as I implement git-lfs on my repo I will bring you the numbers and more discussions, if needed.

aureliobarbosa commented 2 days ago

Nice job!

I'm also trying to read more about git filter-branch and git bfg. I don't know if I will manage to find time to actually use it on this repo, I'm more of studying the tools than actually using. But if I do any progress I will let you know.

I have used git LFS at this repo, maybe there's something in that repo that could help us.

Just now I actually read your comment. I will look into the tools you mentioned, the paper repo seems to be a different case (did you migrate?). I think the main problem is doing the migration and coordinating with everyone else to use git-lfs.

Gui-FernandesBR commented 2 days ago

@aureliobarbosa I have to say I liked the idea of trying the alternative solution first, and then we can try the git LFS. Everything in the data folder is now "definitive", and those files which we deleted we may never need to restore again.

aureliobarbosa commented 1 day ago

Hey @Gui-FernandesBR,

I agree with you about trying cleaning the git history. In this direction, I evaluated tools for cleaning the git history and found that git-filter-repo seems to be a better solution. It has an option to analyze the size of previously deleted files, folders and files by extension (inside git history). Since I am supposing that you going to keep versions of data files inside the repository, I opted to investigate the sizes by file, while sorting them in reverse order. Below is a snapshot of CSV file I generated (I will send it to the team via Discord).

Contrary to initial expectation, only a few big CSV files are stored in the git history and the villains include .nc files, as expected, and notebooks (which store data inside it, of course...). The first big .py is the number 22 appearing on the list and has about 9 Mb (the first version of RocketPy?). The second .py appearing on this list is file number 51. By excluding 49 files on this list you would reduce the size of your git repo by 660Mb. It is important to remember that this files have been deleted from the tip of the main branch.

Considering this, my recommendation would be to delete only those 49 files, since this seems the simplest action to be done. After installing the git-filter-repo this operations can be easily done by putting the undesired files in a single file and running:

git filter-repo --invert-paths --paths-from-file files-i-dont-want-anymore.txt

Note that it DOES rewrite the git history and all developers would need to clone the repository again. I recommend to do this 'surgery' when you finish all PRs you expect to do before the next minor version.


mycode/projects/rocketpy-dev
❯ more path-deleted-sizes.csv 
1, 46978562, 16949319, 2019-02-07, 'docs/sampleDispersionDataReader.ipynb'
2, 46978562, 16949319, 2019-02-07, 'disp/sampleDispersionDataReader.ipynb'
3, 46902996, 40188650, 2023-01-01, 'data/weather/Alcantara_2016_ERA-5.nc'
4, 46774852, 40263054, 2023-01-01, 'data/weather/Alcantara_2017_ERA-5.nc'
5, 46774852, 40126497, 2023-01-01, 'data/weather/Alcantara_2015_ERA-5.nc'
6, 46774852, 40116024, 2023-01-01, 'data/weather/Alcantara_2018_ERA-5.nc'
7, 43664596, 36926254, 2022-09-24, 'data/weather/EuroC_single_level_reanalysis_2000_2021.nc'
8, 21323940, 18974499, 2023-01-01, 'data/weather/CLBI_2016_ERA-5.nc'
9, 21323936, 19420900, 2023-01-01, 'data/weather/SpaceportAmerica_2016_ERA-5.nc'
10, 21265684, 18923920, 2023-01-01, 'data/weather/CLBI_2018_ERA-5.nc'
11, 21265684, 18904131, 2023-01-01, 'data/weather/CLBI_2017_ERA-5.nc'
12, 21265680, 19476628, 2023-01-01, 'data/weather/SpaceportAmerica_2017_ERA-5.nc'
13, 21265680, 19373762, 2023-01-01, 'data/weather/SpaceportAmerica_2015_ERA-5.nc'
14, 21265680, 18918037, 2023-01-01, 'data/weather/CLBI_2015_ERA-5.nc'
15, 13753819, 4050298, 2024-09-21, 'docs/notebooks/fins_roll.csv'
16, 12908689, 2940044, 2024-09-21, 'docs/notebooks/coeff_testing.ipynb'
17, 12894004, 3692363, 2020-03-22, 'nbks/Dispersion Sample.disp_input'
18, 12355867, 3862899, 2021-04-07, 'docs/notebooks/valetudo_dispersion/valetudo_dispersion.ipynb'
19, 11866021, 4080483, 2024-08-04, 'docs/notebooks/airbrakes_example.ipynb'
20, 10830351, 3809773, 2020-03-22, 'nbks/Getting Started - Examples.ipynb'
21, 9860124, 3218580, 2021-04-07, 'docs/notebooks/dispersion_analysis.ipynb'
22, 9802609, 123010, 2020-03-22, 'nbks/rocketpyAlpha.py'
23, 9005216, 8966518, 2022-09-24, 'data/weather/EuroC_pressure_levels_reanalysis_2002-2021.nc'
24, 8589588, 4696361, 2024-08-04, 'docs/notebooks/air_brakes_example.ipynb'
25, 8232142, 2572448, 2023-08-10, 'docs/notebooks/example_hybrid.ipynb'
26, 7849765, 2155194, 2021-04-07, 'docs/notebooks/valetudo_dispersion/Monte_carlo_valetudo.valetudo_disp_o
ut.txt'
27, 6574758, 4701393, 2024-08-03, 'docs/notebooks/environment/environment_class_usage.ipynb'
28, 6313322, 3514941, 2023-06-28, 'docs/notebooks/example_solid.ipynb'
29, 6054970, 1531514, 2020-03-22, 'nbks/Dispersion Sample.disp_output'
30, 5635005, 3817378, 2020-03-22, 'nbks/Environment - Examples.ipynb'
31, 5080208, 4834400, 2022-04-09, 'data/weather/spaceport_america_pressure_level_reanalysis_2015_2021.nc'
32, 4976750, 2880500, 2022-06-07, 'docs/notebooks/SolidMotor_class_usage.ipynb'
33, 4929275, 3217437, 2020-03-22, 'nbks/Dispersion Analysis - Monte Carlo - Example.ipynb'
34, 4782068, 2288881, 2023-08-10, 'docs/notebooks/tank_class_usage.ipynb'
35, 4712149, 3155118, 2019-02-07, 'nbks/Environment Examples.ipynb'
36, 4299000, 215, 2024-09-21, 'docs/notebooks/tail_cL.csv'
37, 4298998, 1038596, 2024-09-21, 'docs/notebooks/tail_cQ.csv'
38, 4273149, 215, 2024-09-21, 'docs/notebooks/nose_cL.csv'
39, 4273147, 1036268, 2024-09-21, 'docs/notebooks/nose_cQ.csv'
40, 4223009, 213, 2024-09-21, 'docs/notebooks/fins_cL.csv'
41, 4223007, 1030363, 2024-09-21, 'docs/notebooks/fins_cQ.csv'
42, 4082870, 4044416, 2022-09-22, 'data/weather/EuroC_pressure_levels_reanalysis_2002_2010.nc'
43, 3418375, 1328400, 2023-08-10, 'docs/notebooks/example_liquid.ipynb'
44, 3274830, 1609382, 2020-03-22, 'nbks/Euporia.ipynb'
45, 3211198, 2261130, 2022-10-10, 'getting_started Dispersion.ipynb'
46, 2695115, 2695950, 2023-09-25, 'docs/static/trajectory-earth.png'
47, 2174231, 911552, 2022-05-19, 'getting_started.ipynb'
48, 2138702, 771662, 2023-01-01, 'data/calisto/CD Test.CSV'
49, 2109965, 836873, 2023-01-01, 'data/euporia/euporiaIDrag.csv'
50, 1933436, 965847, 2018-12-11, 'nbks/Calisto.ipynb'
51, 1525754, 51968, 2024-07-03, 'tests/test_rocket.py'
Gui-FernandesBR commented 1 day ago

Amazing work, @aureliobarbosa ! I believe we can move forward with the git-filter-repo in order to significantly reduce the repo size (probably by half!).

I was not imagining that .ipynb would also be a part of the "villains list", but it makes total sense! When we save the notebooks with images, the ipython interpreter has to convert the image to a hash and store it in the .ipynb file (wich is just a fancy .json), this may consume disk space. Found another reason to migrate .ipynb to .rst files @MateusStano @phmbressan @Lucas-Prates !

@RocketPy-Team/code-owners can you read this thread and let us know that you agree with such operation?

The only concern is that a few files are still being used, therefore cannot be deleted:

Something we should definitely try is to compress the .nc files! Based on my experience, there are some free tools that compress these files, usually reducing the file size by 30%.


With all that been said, I guess a good summary of next steps would be:

  1. Finish all PRs that are currently opened. I think we can "pause" new developments for a few weeks and target finish what we started.
  2. [optional] -> I think we should worry about stashes and local branches that each developer may currently have.
  3. Run the git-filter-repo command to recreate the git history.
  4. Start using git LFS to store .nc and other large files.

As of now, I think your contribution is already quite beneficial for us, @aureliobarbosa ! I will discuss with the other code owners during our next weekly meetings to coordinate the best time to make the step 3 happen. Feel free to work on another meanwhile (let me know if you need any suggestions).