Open deeplow opened 1 year ago
There's another avenue that we can consider:
Since the first stage of the conversion has access to the actual text (for specific document formats), we can grab it, run it through https://github.com/pemistahl/lingua-py, and return the language the model is more confident about. The output can be yet another file in the mounted directory, which must contain a 3 character string with the language code name.
Note that this approach will not work for image formats, but having OCR for those was a bonus in the first place.
Users may not know in advance which language the document is in before opening the document.
So we may want to have some way to autodetect the document language in the first container.Actually, this could be exclusively done after we have the document sanitized, which would make it work for scanned documents or pictures.@apyrgio pointed out to me that the Internet Archive has already done some work on this and there are python libraries that can autodetect the language.
Multilingual documents are probably out of scope.
Notes / Evidence
During user testing, several users mentioned that if they had not opened the document already, how could they know what language it was in?