DerwenAI / kglab

Graph Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, NetworkX, RAPIDS, RDFlib, pySHACL, PyVis, morph-kgc, pslpython, pyarrow, etc.
https://derwen.ai/docs/kgl/
MIT License
581 stars 66 forks source link

Source distribution does not observe the file/path excludes and is too big #294

Closed ceteri closed 1 year ago

ceteri commented 1 year ago

I'm submitting a

This subsumes #275

Currently when using either pyproject.toml or setup.py approaches to build source and wheel, the wheel for a release is ~400Kb while the source distribution is ~300Mb, which is too large for PyPi.

Note that the former was moved to wip/pyproject.toml since its use with Flit appeared to be bugging. Perhaps Poetry works better?

See the MANIFEST.in file, where there are both include and exclude statements getting used. For some reason, these rules are followed when building a wheel, but not when building a source distribution (sdist)

@Mec-iS @tomaarsen @jake-aft - do you have any suggestions about the library packaging here could be structured better for avoid this issue?

jake-aft commented 1 year ago

@ceteri i have pushed up a WIP of the poetry conversion. So there is a pyproject.toml set up for poetry, but have not got to the point to convert to build the package as yet.

ceteri commented 1 year ago

Many thanks @jake-aft

I have a hunch that the Python docs somehow change substantially between 3.8 and latest online (3.11) so that the reserved words here are different: https://docs.python.org/3/distutils/commandref.html#sdist-cmd

ceteri commented 1 year ago

For whatever reason the "egg" part of a source distribution ignores the MANIFEST.in rules and just decides on its own to include every file in the directory, recursively, which produces a sdist that's too large for PyPi to upload. This is with setup.py

I've tried to use pyproject.toml instead, however I cannot get it to follow any of the rules as advertised. We'll need to fix this, somehow.

jake-aft commented 1 year ago

@ceteri maybe its silly, but could we chmod the dirs we want to exclude and make them not readable, and then build the package? Not a great thing, but maybe for a workaround if we need to get the build out?

ceteri commented 1 year ago

Instead of directory permissions, tt'd probably be good if the SPARQL tests could pull down benchmark data from another source, instead of having that all be included in the repo.

That said, there must be ways to exclude paths and files from a source distribution. I've been testing, but there are changes in the Python build specs between 3.8 and 3.10-3.11, which is right where pyproject.toml begins to be required by pip v23+

Something between the docs' diffs and my brain and the Py tools impl just doesn't quite work as advertised? :)

ceteri commented 1 year ago

@jake-aft if poetry doesn't resolve this, i can check in with Python Software Foundation, where they've been having some "debates" regarding this topic, IIRC ...

ceteri commented 1 year ago

Found the issue, which is explained here: https://stackoverflow.com/questions/608855/excluding-a-top-level-directory-from-a-setuptools-package

TL;DR: rm -rf dist build *.egg-info prior to running a build for a new release

In our case, this results in 10:1 reduction of the source distribution. Entirely not documented, AFAICT