-
## Current
The LlamaIndex PDFReader (part of the SimpleDirectoryReader) currently only handles simple (naive) text extraction. It uses the `pypdf` package. It iterates through pages (`pypdf.pdfreader.…
-
First of all, thanks for your lib. It helps me a lot in everyday's work.
I have a problem with a daily pdf report. Some days camelot works
properly and gives 'good text'. Others, it gives a good t…
-
We are producing tagged 2.0-PDFs which attach mathml and tex files as associated files (AF) to **Formula** structure elements. Trying to validate these files also against PDF/A-4 we got failures where…
-
In 2013, there was a _table extraction competition_ at the International Conference on Document Analysis and Recognition. Its organizers released a [comprehensive dataset](http://www.tamirhassan.com/d…
-
**Describe the bug**
Sometimes when using chunking, the `text_as_html` for Table elements is missing some of the content compared to `text` property.
Reasoning:
- Text for a table can only come fro…
-
I am using the hi_res model locally and tried it both with and without chunking as well.
I also tried the chipper model via api, but faced similar issues as well.
**Major issues faced by us while …
-
**My Problem**
I mainly use the pymupdf4llm framework, but I believe the root problem comes from how table extraction is performed in pymupdf. I have pdfs with tables that contains (horizontal and or…
-
**Simple Chat Application** currently allows users to upload documents in various formats—such as PDFs, Word documents, and images—and processes them using **Azure Document Intelligence** for text ext…
-
Hi,
I met this issue when using your package:
Sometimes, the pdf will have some invisable lines / rects, which interferes the table extraction result.
I want to get a pure explicit line chart…
-
### Initial Checks
- [X] I confirm that I'm on the latest version
### Description
I'm trying to use the https://filimoa.github.io/open-parse/processing/parsing-tables/unitable/ support to ext…