explosion / spacy-layout

📚 Process PDFs, Word documents and more with spaCy
MIT License
52 stars 1 forks source link

Loading a pdf results in a StopIteration error #1

Open charlescearl opened 16 hours ago

charlescearl commented 16 hours ago

Running spacy-layout on a Apple M3 Pro with 36GB memory. Python version 3.11.7

The following code is invoked in a python Jupyter notebook:

import spacy
from spacy_layout import spaCyLayout

nlp = spacy.blank("en")
layout = spaCyLayout(nlp)
doc = layout("a4b3a1f45daf416a950584c918f0a007.pdf")

Where a4b3a1f45daf416a950584c918f0a007.pdf is a 33 page 1.6M pdf containing text and pictures and tables.

The error

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
Cell In[15], line 1
----> 1 doc = layout("a4b3a1f45daf416a950584c918f0a007.pdf")

File [~/github/spacy-layout-exploration/.venv/lib/python3.11/site-packages/spacy_layout/layout.py:36](http://localhost:8889/lab/tree/~/github/spacy-layout-exploration/.venv/lib/python3.11/site-packages/spacy_layout/layout.py#line=35), in spaCyLayout.__call__(self, path)
     34 def __call__(self, path: str | Path) -> Doc:
     35     """Call parser on a path to create a spaCy Doc object."""
---> 36     result = self.converter.convert(path)
     37     inputs = []
     38     for item in result.document.texts:

File [~/github/spacy-layout-exploration/.venv/lib/python3.11/site-packages/pydantic/validate_call_decorator.py:60](http://localhost:8889/lab/tree/~/github/spacy-layout-exploration/.venv/lib/python3.11/site-packages/pydantic/validate_call_decorator.py#line=59), in validate_call.<locals>.validate.<locals>.wrapper_function(*args, **kwargs)
     58 @functools.wraps(function)
     59 def wrapper_function(*args, **kwargs):
---> 60     return validate_call_wrapper(*args, **kwargs)

File [~/github/spacy-layout-exploration/.venv/lib/python3.11/site-packages/pydantic/_internal/_validate_call.py:96](http://localhost:8889/lab/tree/~/github/spacy-layout-exploration/.venv/lib/python3.11/site-packages/pydantic/_internal/_validate_call.py#line=95), in ValidateCallWrapper.__call__(self, *args, **kwargs)
     95 def __call__(self, *args: Any, **kwargs: Any) -> Any:
---> 96     res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
     97     if self.__return_pydantic_validator__:
     98         return self.__return_pydantic_validator__(res)

File [~/github/spacy-layout-exploration/.venv/lib/python3.11/site-packages/docling/document_converter.py:161](http://localhost:8889/lab/tree/~/github/spacy-layout-exploration/.venv/lib/python3.11/site-packages/docling/document_converter.py#line=160), in DocumentConverter.convert(self, source, raises_on_error, max_num_pages, max_file_size)
    146 @validate_call(config=ConfigDict(strict=True))
    147 def convert(
    148     self,
   (...)
    152     max_file_size: int = sys.maxsize,
    153 ) -> ConversionResult:
    155     all_res = self.convert_all(
    156         source=[source],
    157         raises_on_error=raises_on_error,
    158         max_num_pages=max_num_pages,
    159         max_file_size=max_file_size,
    160     )
--> 161     return next(all_res)

StopIteration:
ines commented 4 hours ago

Thanks for trying it out! This looks like it's triggered by Docling so if the following causes the same issue, maybe you could raise it on their tracker?

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("a4b3a1f45daf416a950584c918f0a007.pdf")