emcf / thepipe

Extract clean markdown from PDFs, URLs, Word docs, slides, videos, and more, ready for any LLM. ⚡
https://thepi.pe
MIT License
1.13k stars 70 forks source link

Pytesseract error when text_only is True within GitHub Action #22

Open emcf opened 4 months ago

emcf commented 4 months ago

If Tesseract OCR is not installed correctly, image extraction with text_only=True will yield tesseract is not installed or it's not in your PATH. See README file for more information.. This occurs with improper Tesseract installation, such as the case with the current GitHub actions CI setup for this repo.

Further discussion here: https://stackoverflow.com/questions/50951955/pytesseract-tesseractnotfound-error-tesseract-is-not-installed-or-its-not-i