-
New York scraper from pdf (https://doccs.ny.gov/system/files/documents/2022/01/copy-of-incarceratedindividualdailycovid_table_forpio-2022.01.27_0.pdf) is started on line 10 -- likely a cropping issue.…
-
`ObjectExtractorStreamEngine.java` contains [this code](https://github.com/tabulapdf/tabula-java/blob/adb7738c87f0019cf95519ff37b58e4d4992c51d/src/main/java/technology/tabula/ObjectExtractorStreamEngi…
-
Currently, text extraction adds roughly 10x overhead to parsing a PDF. To optimize it, we can:
1. generate a C++ parser, possibly by supporting any primitives not supported already;
2. optimize the …
-
While testing I have faced a case where `table.accuracy` is negative number.
PDF:[page-3.pdf](https://github.com/camelot-dev/camelot/files/3455388/page-3.pdf)
Code:
```
tables=camelot.read_pdf('…
-
**My Problem**
I mainly use the pymupdf4llm framework, but I believe the root problem comes from how table extraction is performed in pymupdf. I have pdfs with tables that contains (horizontal and or…
-
**Describe the bug**
Sometimes when using chunking, the `text_as_html` for Table elements is missing some of the content compared to `text` property.
Reasoning:
- Text for a table can only come fro…
-
- Knowledge mapping: combination of different existing R packages / resources to visualise the link between concepts in scientific texts based on NLP neural nets
- Table extraction: R script to extra…
-
memary currently parses the agents' responses, which are stored in a .txt file, before inserting them into our knowledge graphs.
As we look to support agentic systems running real-world tasks, our…
-
**Describe the bug**
I am evaluating the UnstructuredClient for processing PDF documents and am encountering an issue with the Greek language text extraction. When I attempt to extract text from PDF …
-
## Current
The LlamaIndex PDFReader (part of the SimpleDirectoryReader) currently only handles simple (naive) text extraction. It uses the `pypdf` package. It iterates through pages (`pypdf.pdfreader.…