Open sudwhiwdh opened 1 year ago
I'm afraid that neither the GUI nor the CLI offer such an option. The workaround for now would be to batch the documents for each language into separate CLI invocations (i.e., different --ocr-lang <lang>
parameters).
As for the typo you found, that's interesting. It was there since the first release of Dangerzone. We'll fix it though, thanks.
As for the typo you found, that's interesting. It was there since the first release of Dangerzone. We'll fix it though, thanks.
How can I select the appropriate OCR languages in the application interface if I have documents in different languages at the same time?
Totally a potential issue. Furthermore, what if the document is in multiple languages? (this is a question that surfaced during using testing two weeks ago at the International Journalism Festival).
The whole OCR thing will need some re-thinking. One thing we realized as well is that users may not know in advance which language the document is in before opening it.
And not everybody is familiar with the term OCR, but they much more familiar with the idea of having a document searchable. So maybe the text should instead be: make document searchable.
As for the multi-lingual situation, there's a technical limitation. The way we do OCR, the tool (tesseract) needs to know the language. Not sure it can autodetect.
As for the multi-lingual situation, there's a technical limitation. The way we do OCR, the tool (tesseract) needs to know the language. Not sure it can autodetect.
After a bit of research, while it looks like Tesseract itself isn't equipped to detect document languages, there are some models (like langid -- which seems no longer supported) that do it. We could theoretically run this auto-detection on every page.
But I wonder if having mixed languages will be technically complicated because even in the same page there may be multiple languages. So I'm wondering if we'll need document layout analysis and then apply OCR in that section. Food for thought. If that's the case, then OCR in documents with multiple languages may be more a long term goal.
Today I learned that tesseract OCR has multilingual support
https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html#using-multiple-languages
How can I select the appropriate OCR languages in the application interface if I have documents in different languages at the same time?
And related to the screenshot: Why is there a comma between document and language?
Evidence / Notes (added by @deeplow)
During user testing, one user surfaced the hypothesis of having a multilingual document