-
This is an amazing project, and the document extraction model works really well. I would love to propose an integration between RAGFlow and Indexify - https://getindexify.ai
Indexify is an Apache …
-
In https://github.com/freelawproject/courtlistener/issues/3469, we found that if a PDF is uploaded multiple times almost simultaneously, the PDF can be extracted and saved multiple times unnecessarily…
-
**Bug report**
I'm working on a PDF parsing project.
I have created an AI model that finds and extracts all the tables in a PDF. now I just need a way to get the raw text without layout and tables…
-
This is a continuation of a discussion posted [here](https://github.com/jsvine/pdfplumber/discussions/911), please check for more info.
## Describe the bug
When the pdf has overlapping columns (…
-
Used Docker and Grobid 0.8.0, performing full text extraction from the following PDF:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10125888/pdf/10.1177_23328584231165919.pdf
XML fragment of the …
-
## Objective
create own tesseract model using pytesseract to improve extraction from pdf files. Compair results with basic extraction using pymudf or pypdf2
## Key Features
- [ ] own model is t…
-
Hi Everyone,
I've been using Pdfminer for the last few months, I really thing it's a very helpful codebase.
But recently I noticed that clipping paths do not seem to be implemented, I inspected:…
-
**Issue**
Vertical orientated chinese document unable to return any extraction.
**Code to reproduce**
```
from llama_parse import LlamaParse
from llama_parse.utils import (
nest_asyncio_er…
-
I am using Camelot for table extraction in PDF documents, which generally works well for my needs. However, I've encountered a recurring issue where the first and last rows of tables cause problems du…
-
I have noticed the issue with PDF miner.
It returns different results each time for my PDF doc. This is my code:
```
import requests
from io import BytesIO
from pdfminer import high_level
d…