Support for multiple OCR languages for parallel conversion of documents in different languages

freedomofpress / dangerzone

Take potentially dangerous PDFs, office documents, or images and convert them to safe PDFs

https://dangerzone.rocks/

GNU Affero General Public License v3.0

3.67k stars 172 forks source link

Support for multiple OCR languages for parallel conversion of documents in different languages #403

Open sudwhiwdh opened 1 year ago

sudwhiwdh commented 1 year ago

How can I select the appropriate OCR languages in the application interface if I have documents in different languages at the same time?

And related to the screenshot: Why is there a comma between document and language?

Evidence / Notes (added by @deeplow)

During user testing, one user surfaced the hypothesis of having a multilingual document

apyrgio commented 1 year ago

I'm afraid that neither the GUI nor the CLI offer such an option. The workaround for now would be to batch the documents for each language into separate CLI invocations (i.e., different --ocr-lang <lang> parameters).

As for the typo you found, that's interesting. It was there since the first release of Dangerzone. We'll fix it though, thanks.

sudwhiwdh commented 1 year ago

As for the typo you found, that's interesting. It was there since the first release of Dangerzone. We'll fix it though, thanks.

https://github.com/freedomofpress/dangerzone/pull/404

deeplow commented 1 year ago

How can I select the appropriate OCR languages in the application interface if I have documents in different languages at the same time?

Totally a potential issue. Furthermore, what if the document is in multiple languages? (this is a question that surfaced during using testing two weeks ago at the International Journalism Festival).

The whole OCR thing will need some re-thinking. One thing we realized as well is that users may not know in advance which language the document is in before opening it.

deeplow commented 1 year ago

And not everybody is familiar with the term OCR, but they much more familiar with the idea of having a document searchable. So maybe the text should instead be: make document searchable.

As for the multi-lingual situation, there's a technical limitation. The way we do OCR, the tool (tesseract) needs to know the language. Not sure it can autodetect.

deeplow commented 1 year ago

As for the multi-lingual situation, there's a technical limitation. The way we do OCR, the tool (tesseract) needs to know the language. Not sure it can autodetect.

After a bit of research, while it looks like Tesseract itself isn't equipped to detect document languages, there are some models (like langid -- which seems no longer supported) that do it. We could theoretically run this auto-detection on every page.

But I wonder if having mixed languages will be technically complicated because even in the same page there may be multiple languages. So I'm wondering if we'll need document layout analysis and then apply OCR in that section. Food for thought. If that's the case, then OCR in documents with multiple languages may be more a long term goal.

deeplow commented 1 year ago

Today I learned that tesseract OCR has multilingual support

multiling

https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html#using-multiple-languages