The LlamaIndex PDFReader (part of the SimpleDirectoryReader) currently only handles simple (naive) text extraction. It uses the pypdf package. It iterates through pages (pypdf.pdfreader.pages) then uses the page.extract_text() method to grab to text for the document.
The following are ignored:
Images
Tables
PDF Metadata
Document structure such as headings
We should be able to improve retrieval by extracting information present in these components
Solution
Fork the standard LlamaiIndex PDFReader and customise it. Look into the various LlamaIndex Image readers.
Current
The LlamaIndex PDFReader (part of the SimpleDirectoryReader) currently only handles simple (naive) text extraction. It uses the
pypdf
package. It iterates through pages (pypdf.pdfreader.pages
) then uses thepage.extract_text()
method to grab to text for the document.The following are ignored:
We should be able to improve retrieval by extracting information present in these components
Solution
Fork the standard LlamaiIndex PDFReader and customise it. Look into the various LlamaIndex Image readers.
Alternatives
Use readers from Unstructured.io