Filimoa / open-parse

Improved file parsing for LLM’s
https://filimoa.github.io/open-parse/
MIT License
2.55k stars 100 forks source link

No nodes are extracted from some PDFs #85

Open faileon opened 1 week ago

faileon commented 1 week ago

Initial Checks

Description

I've noticed that when I split my PDF via Firefox to have a smaller PDF (e.g. first 10 pages), openparse wont extract any nodes. Original PDF gets extracted fine.

image

When I specify table_args, it will make parser return some nodes, but all are identified as a table. image

I am attaching the PDF, perhaps someone could have a look what's wrong. concept-vp4360-cz.pdf

Example Code

No response

Python, open-parse & OS Version

python_version: 3.12.7
operating_system: Linux
os_version: 6.11.8-arch1-2
open-parse version: 0.7.0
python version: 3.12.7 (main, Oct  1 2024, 11:15:50) [GCC 14.2.1 20240910]
platform: Linux-6.11.8-arch1-2-x86_64-with-glibc2.40
related packages: torchvision-0.20.1 tokenizers-0.20.3 torch-2.5.1 pydantic-2.9.2 PyMuPDF-1.24.13 transformers-4.46.2