Future-House / paper-qa

High accuracy RAG for answering questions from scientific documents with citations
Apache License 2.0
6.5k stars 623 forks source link

Support for full PDF "image text" OCR in pymupdf #184

Open kvnxiao opened 1 year ago

kvnxiao commented 1 year ago

Can we add some sort of toggle / support for enabling full page OCR reading via Tesseract, when pymupdf is installed? I hacked around the vendored library in my local virtualenv and made a change in readers.py to something like which allows it to work, but an upstream solution would be better:


def parse_pdf_fitz(# ...
# ...
    for i in range(file.page_count):
            page = file.load_page(i)
            tp = page.get_textpage_ocr(dpi=300, full=True)
            page_text = page.get_text(textpage=tp, sort=True)
            # print(page_text)
            split += page_text
            pages.append(str(i + 1))
# ...
Snikch63200 commented 1 month ago

Hello,

From my experience, easyOCR (https://github.com/JaidedAI/EasyOCR) performs much better than Tesseract (but uses more ressources...)