UB-Mannheim / zotero-ocr

Zotero Plugin for OCR
GNU Affero General Public License v3.0
552 stars 40 forks source link

Build a front end for OCRmyPDF (development suggestion) #75

Closed e42mercury closed 2 months ago

e42mercury commented 5 months ago

I love that I can use this directly in Zotero, but it has few options and sometimes creates really huge files. A lot of the feature requests and errors people are asking about here have already been addressed by another program, OCRmyPDF. It is based on the same tesseract engine but also handles rotation, compression, languages... really an excellent program all around. Instead of rewriting a scaled-down version of the same program as a Zotero plugin, I'd suggest a plugin that is just a front end for OCRmyPDF. Just a window inside Zotero that would help you 1) install everything and 2) choose the many options in OCRmyPDF, described in its excellent documentation, while it runs in the background.

The only problem is that it's command line only. So it would definitely be a contribution to the use community if someone made it useable from a Zotero window. Or a standalone GUI. Just a suggestion.

stweil commented 5 months ago

We know OCRmyPDF very well (see for example https://ocr-bw.bib.uni-mannheim.de/anwendung/druckwerke/). OCRmyPDF is a useful tool for those who need its many options and features. But it would make the Zotero OCR plugin much larger and more complex. I like small solutions which are simple to handle for the users. Therefore I am not sure that supporting OCRmyPDF would be a good idea.

The large PDF files which the current Tesseract produces are also a well known problem (see issue #42). I think that it must be addressed in the Tesseract code. The latest Tesseract release 5.4.0 improved the text positions in the generated PDF and makes them a little bit smaller, but much more size reductions are possible. https://github.com/tesseract-ocr/tesseract/pull/4171 shows an example. My favourite is better compression of the PDF code and using image formats like JPEG 2000 with high compression inside of the PDF file.

e42mercury commented 5 months ago

OK, that makes sense. thanks for the links on the latest improvements to tesseract. I've just updated so I'll try it out. I'll keep using OCRmyPDF but I also love having the OCR plugin inside Zotero. Thanks!