Selecting multiple languages for OCR

cyanfish / naps2

Scan documents to PDF and more, as simply as possible.

https://www.naps2.com

Other

2.74k stars 321 forks source link

Selecting multiple languages for OCR #305

Closed vivadavid closed 6 months ago

vivadavid commented 8 months ago

Hi,

I wanted to suggest the possibility of selecting more than one language for the OCR engine, which would help with multilingual documents. The way it works now, you can only select one language at a time.

On a separate note, I wanted to ask a question (I apologize if the issue is explained somewhere else and I couldn't find the information). When you open a PDF document and then apply OCR on it, is the OCR added as a new layer on the document with no further changes made on it or is a completely new PDF generated with a inevitable reduction in the quality of the original?

Thanks!

cyanfish commented 8 months ago

OCR works as a new layer. Image editing (e.g. rotation, crop) is what you want to avoid to keep the original quality.

vivadavid commented 8 months ago

OCR works as a new layer. Image editing (e.g. rotation, crop) is what you want to avoid to keep the original quality.

Great to know, thanks!

Also, could I suggest adding the Tesseract version in the Releases section of Github (when a new version is included) and also on the About section of the programme? I'm currently not sure which Tesseract version is included. Version 5.3.4 was recently released, though the Mannheim binaries are still on 5.3.3.

Thanks again!

cyanfish commented 8 months ago

I don't update Tesseract often as changes rarely affect the functionality NAPS2 uses. You can check the version used here.

vivadavid commented 8 months ago

I don't update Tesseract often as changes rarely affect the functionality NAPS2 uses. You can check the version used here.

Thanks for the information: it's currently on 5.2.0, as I can see. It'd be nice to have the latest version, but I understand it must take time to update it. However, I'd like to point out that 5.3.3 included a fix for an issue that can affect the quality of the OCR:

https://github.com/tesseract-ocr/tesseract/issues/4014

cyanfish commented 8 months ago

Thanks for pointing that out, I'll update that for the next NAPS2 version.

Tarek-Hasan commented 7 months ago

Hi, you should check out OCRmyPDF, if you can integrate it with naps2 to helps with OCR related issues. This tool is build upon Tesseract and specialized to ease PDF OCR. It supports multiple language. It also doesn't change the resolution of the embedded images like other PDF OCR tools.

cyanfish commented 6 months ago

Multiple Languages can now be selected as an option (in the "OCR language" dropdown) in 7.4.0.

Also 7.4.0 has updated Tesseract to 5.3.4.

vivadavid commented 6 months ago

Thank you for adding multiple language selection on the latest release. Appreciated!

vivadavid commented 6 months ago

Multiple Languages can now be selected as an option (in the "OCR language" dropdown) in 7.4.0.

Also 7.4.0 has updated Tesseract to 5.3.4.

I wanted to ask you a question, though: I can see no binaries for Tesseract 5.3.4 from Mannheim. Did you get the binaries from another source or just compiled them yourself?

cyanfish commented 6 months ago

I compile them myself. https://github.com/cyanfish/naps2-tesseract has the compiled binaries and my scripts that include all the flags etc to keep the compiled size down <5MB.

vivadavid commented 6 months ago

Interesting: thanks!