OCR pdf - Githubissues

Leah9 commented 2 years ago

OCR the pdf for easier searching

Leah9 commented 2 years ago

I have looked at doing this and the current conclusion is that it is too complex to add to this project at the moment but i will leave it as an open feature request if anyone else would like to have a go. The generated pdf can be OCR'd using the following : ocrmypdf https://ocrmypdf.readthedocs.io/en/latest/index.html It is not a casual installation but it is very quick once it is working.

ArielMAJ commented 2 years ago

I read the provided link only a bit. Could you explain a little what you'd like to do? From what I understood, OCR would make the screenshots in the PDFs behave like actual text (become selectable, etc). Is that what you'd like to add to this tool?

ArielMAJ commented 2 years ago

I'm not sure how licenses work so I'm not sure of what I'm gonna say rn. From the looks of it, it seems we'd need ghostscript to use ocrmypdf. It's AGPL license, meaning it probably would force us into being under GPL license as well (?). I don't like that license much as it forces a lot of things onto the developer (quite annoying to use and restrictive 😢). You should give it some thought and research about it a little before deciding to add it to this project.

I'll try to install and test ocrmypdf a little. If I find some easy way to use it I'll let you know.

ArielMAJ commented 2 years ago

Using chocolatey it ends up being pretty easy to install the requirements on windows (any other way feels really complicated and was pretty stressful for me). After installing chocolatey (which is just one copypaste away once you're on an administrative shell), all you need to do is use these two commands on a terminal with administrator privileges:

Then pip install ocrmypdf.

Once everything is set up, the following code should already apply the text overlay onto the pdf:

import ocrmypdf
ocrmypdf.ocr('Binder.pdf', 'Binder.pdf', deskew=True)

Would you like to add it to this project? We could add some type of checks to see if the user has tesseract and ghostscript on their first launch and save it to a txt file/every time we try to add the overlay (we could maybe add a menu bar to activate/deactivate the text overlay option as well).

Leah9 / screengrab

OCR pdf #3