ContentMine / norma

Convert XML/SVG/PDF into normalised, sectioned, scholarly HTML
Apache License 2.0
36 stars 21 forks source link

Repository is unexpectedly large #71

Open ghost opened 7 years ago

ghost commented 7 years ago

The GitHub API gives the size of the Norma repository as 362425 KB and the AMI repository as 301415 KB.

The recent experience of two new developers, both of whom needed to buy additional hardware in order to be able to clone and work with these repositories, suggests that new users or developers are unlikely to expect these repositories to be so large.

Ways of reducing the size of the repositories should be investigated. For instance, could the repositories' test corpora be factored out into a different module that can be shared, as a dependency, between Norma, AMI, and perhaps other modules in the AMI stack?

(Corresponding AMI issue: https://github.com/ContentMine/ami/issues/70 .)

mdales commented 5 years ago

This is a problem when trying to create a docker image from these tools. As of today:

I can filter out test and git from going into the docker setup, but the build process for normami generates a debian file (that isn't used AFAICT in running the tools) which relies on some example files from test (locally I've just commented out making the deb file for now).

Once filtered, there's only 350MB left used to build the image (5% of the storage!). That could go down more I suspect, but it'd be a massive start to just move the test data and then purge the git history of this data.

Git really isn't the best place to store large test data, or if you are going to do this you at least want it in a submodule, so that the main repository can remain lean. Git LFS may also be a solution here.

petermr commented 5 years ago

Agreed. The *.deb is not critical. Some people used it in the past. The appassembler script included it from way back. Yes, the test and git can be dropped as well. The test stuff needs purging anyway but that's a month of my time I suspect. Happy to talk about the best strategy when we meet. Docker will be critical to our future plans. I hope to demo it at Oxford in Sept.