DS4SD / docling-parse

Simple package to extract text with coordinates from programmatic PDFs
MIT License
30 stars 8 forks source link

missing sdist in pypi index #24

Closed shubhbapna closed 1 month ago

shubhbapna commented 2 months ago

Would it be possible to start publising sdists for docling-parse along with the wheels on the pypi index? We are trying to package Docling for InstructLab and need the sources to build it.

dolfim-ibm commented 2 months ago

As a test the sdist for docling-parse==1.3.0 was pushed manually. Next we will test doing it with CI.

tiran commented 2 months ago

The sdist is incomplete. You either have to create a MANIFEST.in to include additional resources or use setuptools-scm. I can help you tomorrow. Once everything is set up correctly, it should be as easy as install build and then doing python3 -m build -s to create a source dist using PEP 517 interface.

I recommend that you create an sdist in your CI/CD pipeline, then generate the wheels from the sdist. This process ensures that your sdist is correct and wheels can be built from the sdist. The build tool uses this approach, too.

For InstructLab, I have completely automated the process of assigning version based on git tag, creating sdist + wheel, signing the payload with sigstore, and finally uploading the sdist and wheels to PyPI with PyPI's trusted publishing process. The team creates a release on GH and the rest of the release process is done by CI/CD.

tiran commented 2 months ago

Disclaimer: I'm not familiar with poetry.

According to Poetry's docs, the presence of include disables VCS auto-detection of includes. You have to specify all includes manually. Poetry's syntax is different than MANIFEST.in. This should do the trick (untested):

--- a/pyproject.toml
+++ b/pyproject.toml
@@ -29,7 +29,15 @@ readme = "README.md"
 packages = [{include = "docling_parse"}]
 include = [
     {path = "docling_parse/*.so", format = "wheel"},
-    {path = "docling_parse/pdf_resources", format = "wheel"}
+    {path = "docling_parse/pdf_resources", format = "wheel"},
+    {path = "CMakeLists.txt", format = "sdist"},
+    {path = "*.md", format = "sdist"},
+    {path = "poetry.lock", format = "sdist"},
+    {path = "app/*.cpp", format = "sdist"},
+    {path = "cmake/", format = "sdist"},
+    {path = "app/", format = "sdist"},
+    {path = "src/", format = "sdist"},
+    {path = "tests/", format = "sdist"},
 ]
 build = "build.py"

The recommended approach is to build a sdist in CI/CD first, then build the wheels from the sdist instead of a checkout.

PeterStaar-IBM commented 2 months ago

@tiran great, we will have a look! Really appreciate the pointers and we will be able to replicate it easily for the deepsearch-glm