Closed mmyrte closed 4 years ago
Good summary. I think we could have two repositories then. One for the package, and one for the rest. For me tutorials, papers etc. are in the same category.
Thanks for the suggestions!
Edit: I moved the discussion about large files to a new issue (#56).
Thanks for the suggestions and thanks for the list of large files!
/climada
. Therefore we need to radically minimise these files indeed./climada
/data
, /script
and /doc/tutorial
are not disturbing the packaging, at least not for pip but from what I've seen not for conda either. For me, they can stay where they are./tests
is seemingly just a placeholder directory for the tutorial, I think it's just to demonstrate that not everything must be included in the package. I don't think we need it.My argument for moving big files away from /data
etc. is to make the cloning of the repo more lightweight; this would also mean deleting large files from history, i.e rewriting it. I understand if you want to leave that particular stone unturned. We have the climada.ethz.ch
domain/VM already, so we could use that to host (slightly) larger files.
The argument for keeping /tests
is that we can move the testing code there - AFAIK, the CI would still run off of this repo instead of the package. I personally would appreciate the clearer separation of actual programme logic and tests, but if you want users to be able to validate the correct functioning of the software on their machines, then it's a no-go. In that case, I would consider it curteous to the user to only download an archive of test files if the tests are indeed executed.
About the large files:
git count-objects -vH
).feature/supplychain
. Removing that branch reduces the repository size already to ~280 MiB. Can you comment whether we can somehow get rid of those huge files, @KasparTo?This is a follow-up about the large files:
Here is a list of large files that are obsolete (i.e., can be safely removed because they have already been replaced), removing those reduces repository size by 110 MiB already:
``` script/applications/eca_san_salvador/San_Salvador_Risk-Copy1.ipynb climada/hazard/test/data/cropping_test_LS.tif dist/climada-0.0.1.tar.gz data/F101992.v4b_web.stable_lights.avg_vis.tif.gz climada/test/data/system/admin0.mat climada/test/data/GLB_NatID_grid_0360as_adv_1.mat climada/test/data/GLB_NatID_grid_0360as_adv_2.mat data/F152007.v4b_web.stable_lights.avg_vis.tif.gz data/F162007.v4b_web.stable_lights.avg_vis.tif.gz data/F182012.v4c_web.stable_lights.avg_vis.tif.gz data/demo/gdp2asset_demo_exposure.nc ```
Here is a list of the files larger than 3 MiB (except those in feature/supplychain
) and the people who added them to the repository. Can those people please replace the files by smaller ones or give reasoning why the files have to be that large?
climada/hazard/test/data/Victoria_firms.csv
(bush fire hazard, originally Marine Perus, no GitHub-account)data/demo/h08_gfdl-esm2m_ewembi_historical_histsoc_co2_dis_global_daily_DEMO_FR_2001_2005.nc
(@sameberenz)data/demo/flddph_WaterGAP2_miroc5_historical_flopros_gev_picontrol_2000_0.1.nc
(@ingajsa)data/demo/fldfrc_WaterGAP2_miroc5_historical_flopros_gev_picontrol_2000_0.1.nc
(@ingajsa)climada/hazard/test/data/test_global_landslide_nowcast_20190508.tif
(@Evelyn-M)climada/hazard/test/data/test_global_landslide_nowcast_20190501.tif
(@Evelyn-M)climada/hazard/test/data/test_global_landslide_nowcast_20190509.tif
(@Evelyn-M)data/system/ls_pr_NGI_UNEP/ls_pr.tif
(@Evelyn-M)climada/hazard/test/data/nasa_global_landslide_catalog_point.dbf
(@Evelyn-M)data/demo/earth_engine/landcover.tif
(@raychpistache)data/demo/earth_engine/rgb_zurich.tif
(@raychpistache)Here is a list of files where we can save some space without removing:
script/tutorial/5_blackmarble.ipynb
(the file might be okay, but there are several obsolete revisions in the git history)script/applications/eca_san_salvador/San_Salvador_Risk.ipynb
(the file might be okay, but there are several obsolete revisions in the git history)data/demo/tc_fl_1975_2011.h5
(used in several tests, could probably be reduced in size easily)data/demo/atl_prob.mat
(used in many tests, could probably be reduced in size easily)data/demo/WS_ERA40.mat
(never used in the code, any ideas if this is needed?)data/system/GLB_NatID_grid_0360as_adv_2.mat
(has already been replaced by a smaller file of the same name)Addressing all of the above files will bring the repository size down to less than 130 MiB which seems acceptable to me.
However, permanently removing files from a git history changes a lot of commit hashes. That's why I think we should wait for all files listed above to be discussed and replaced in all currently active branches (most important: main and develop). Then we can make one single change that removes large files once and for all (e.g. using the tool https://rtyley.github.io/bfg-repo-cleaner/). After that, we can implement a much stricter policy for large files and we should be fine in the future.
Hi Thomas, thanks for the check on the files. Should we change the files and tests using them directly on develop branch?
Hi Samuel, since we are preparing for a release, the develop branch should be handled with care. It will be frozen and merged into the main branch very soon. Double-check that your changes don't break any tests (including unit and integration tests!), preferably create a feature-branch and a PR in a first step.
Thanks a lot for the compilation and the suggestions. Sounds like a sound plan to me.
Please continue the discussion about the large files in the new issue #56. This issue here is more about changing some of the paths, maybe moving some parts of the repository to a new repository or to some other external location.
Resolution At some point the tests should be moved out of the package directory in order to keep the installation as light weighted as possible. Apart from that the repository structure will stay as is for the time being.
Both issue #46 (util functions) and #37 (config & constants) have referred to the topic of "where do we keep what". If we (or probably just @emanuel-schmid 😀) are going to package climada for distribution, we're going to need to reorganise the repository. According to the official docs, this is the structure a package should follow:
_I'll use
climada_python
as the root/
from here on. Also, this is just a proposal, forgive my absolute language._/data
,/script
, and/doc/tutorial
need to move into other github repos./script
and/data/demo
can be moved there as well./data/system
, as discussed in #37, should be created on the user's machine, like e.g.$HOME/.climada
; we should probably check the licenses and host it outside a repository./tests
, along with all the unit tests now living in/climada
subdirs.I think this topic is quite involved and needs a lot of reading up on the formats of packaging formats for pypi/wheel/setuptools and conda/conda-forge/conda-build. I only stuck my toes into it and would be glad to let @emanuel-schmid decide for all of us, because I imagine that none of us has the knowledge or the time to find best practices and implement them.