docqai / docq

Private ChatGPT/Perplexity. Securely unlocks knowledge from confidential business information.
https://docqai.github.io/docq/
GNU Affero General Public License v3.0
57 stars 9 forks source link

CORE: Sophisticated PDFReader with Image and Table extraction #127

Open janaka opened 1 year ago

janaka commented 1 year ago

Current

The LlamaIndex PDFReader (part of the SimpleDirectoryReader) currently only handles simple (naive) text extraction. It uses the pypdf package. It iterates through pages (pypdf.pdfreader.pages) then uses the page.extract_text() method to grab to text for the document.

The following are ignored:

We should be able to improve retrieval by extracting information present in these components

Solution

Fork the standard LlamaiIndex PDFReader and customise it. Look into the various LlamaIndex Image readers.

Alternatives

Use readers from Unstructured.io

janaka commented 1 year ago

Also look into LayoutPDFReader by LLMSherpa