OCR language autodetection

freedomofpress / dangerzone

Take potentially dangerous PDFs, office documents, or images and convert them to safe PDFs

GNU Affero General Public License v3.0

3.48k stars 163 forks source link

Users may not know in advance which language the document is in before opening the document. ~~So we may want to have some way to autodetect the document language in the first container.~~ Actually, this could be exclusively done after we have the document sanitized, which would make it work for scanned documents or pictures.

@apyrgio pointed out to me that the Internet Archive has already done some work on this and there are python libraries that can autodetect the language.

Multilingual documents are probably out of scope.

Notes / Evidence

During user testing, several users mentioned that if they had not opened the document already, how could they know what language it was in?

freedomofpress / dangerzone

OCR language autodetection #407

Notes / Evidence