freedomofpress / dangerzone

Take potentially dangerous PDFs, office documents, or images and convert them to safe PDFs
https://dangerzone.rocks/
GNU Affero General Public License v3.0
3.48k stars 163 forks source link

OCR language autodetection #407

Open deeplow opened 1 year ago

deeplow commented 1 year ago

Users may not know in advance which language the document is in before opening the document. So we may want to have some way to autodetect the document language in the first container. Actually, this could be exclusively done after we have the document sanitized, which would make it work for scanned documents or pictures.

@apyrgio pointed out to me that the Internet Archive has already done some work on this and there are python libraries that can autodetect the language.

Multilingual documents are probably out of scope.

Notes / Evidence

During user testing, several users mentioned that if they had not opened the document already, how could they know what language it was in?

apyrgio commented 1 year ago

There's another avenue that we can consider:

Since the first stage of the conversion has access to the actual text (for specific document formats), we can grab it, run it through https://github.com/pemistahl/lingua-py, and return the language the model is more confident about. The output can be yet another file in the mounted directory, which must contain a 3 character string with the language code name.

Note that this approach will not work for image formats, but having OCR for those was a bonus in the first place.