OCR-D / ocrd_all

Master repository which includes most other OCR-D repositories as submodules
MIT License
72 stars 17 forks source link

recursive clone of ocrd_fileformat bloats docker with assets #359

Closed bertsky closed 1 year ago

bertsky commented 1 year ago

https://github.com/OCR-D/ocrd_all/blob/d8830fb2daacc5c90347091c52f9e2ed574e41c9/Makefile#L356

This causes all docker images to contain a complete checkout of the assets test data repository. (Twice even, because of the .git index.)

bertsky commented 1 year ago

Correction: 4 times:

/build/.git/modules/ocrd_fileformat/modules/ocr-fileformat/modules/vendor/page-to-alto/modules/repo/assets
/build/.git/modules/ocrd_fileformat/modules/assets
/build/ocrd_fileformat/repo/ocr-fileformat/vendor/page-to-alto/repo/assets
/build/ocrd_fileformat/repo/assets
bertsky commented 1 year ago

Note: all of these are complete checkouts. They sum up to 630 MB.

kba commented 1 year ago

Good point and the size and multiple instances of assets have been irking me for a while.

Ideally, we should reduce the size of the repo, i.e. the 20MB TIFF in dfki-testdata and the size of the git index. Perhaps we should move away from integrating assets as a submodule and use GH releases in the make assets recipes.

At least we don't add the assets to the already large docker image...

bertsky commented 1 year ago

Well, for now, at least for the Docker build, by adding **/assets/* to .dockerignore, we gain about 1 GB.

For native installation, not sure how we can avoid duplicating the checkout though. Yes, using GH releases or artifacts for the make test could maybe be one way.