HazyResearch / pdftotree

:evergreen_tree: A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.
MIT License
428 stars 90 forks source link

PyPI v0.5.0 sdist is missing test data #114

Open jayvdb opened 3 years ago

jayvdb commented 3 years ago
pdftotree-0.5.0/tests> ls
__init__.py  test_basic.py  test_table_detection.py

As a reult, running the tests from the PyPI sdist isnt currently possible.

The usual/old solution is to create a MANIFEST.in , which can be generated using check-manifest , but there are newer ways and tools if you do not want another metadata file in the repo.

lukehsiao commented 3 years ago

I see your point. One concern I would have is that the test data can be quite large. I'm not sure we actually want to bundle that together. Is there a good best practice to follow here?

jayvdb commented 3 years ago

Best practise is to distribute wheels which only include the minimum needed to use the package, and a sdist which is a version snaphot of the 'source' which includes whatever is considered useful, e.g. docs, tests, test data, etc.

lukehsiao commented 3 years ago

Ah, I see. That makes sense then.

Would you happen to have the time to put together a PR for this? If not, it may be some time before I can get around to looking into this. I'll need to read up on the trade-offs of MANIFEST.in compared to what the more recent methods are.