Can we add some sort of toggle / support for enabling full page OCR reading via Tesseract, when pymupdf is installed? I hacked around the vendored library in my local virtualenv and made a change in readers.py to something like which allows it to work, but an upstream solution would be better:
def parse_pdf_fitz(# ...
# ...
for i in range(file.page_count):
page = file.load_page(i)
tp = page.get_textpage_ocr(dpi=300, full=True)
page_text = page.get_text(textpage=tp, sort=True)
# print(page_text)
split += page_text
pages.append(str(i + 1))
# ...
Can we add some sort of toggle / support for enabling full page OCR reading via Tesseract, when pymupdf is installed? I hacked around the vendored library in my local virtualenv and made a change in
readers.py
to something like which allows it to work, but an upstream solution would be better: