Closed kaerumy closed 2 years ago
That's an interesting point.
We are using Optimaize https://github.com/optimaize/language-detector
For now, it does not support multi-langual detection :
When a text is written in multiple languages, the default algorithm of this software is not appropriate.
You can try to split the text (by sentence or paragraph) and detect the individual parts. Running the
language guesser on the whole text will just tell you the language that is most dominant, in the best
case.
But it is a use case that we've already come upon.
Not on the roadmap yet but we will soon allow user to specify the language when extracting text (not exactly the same issue but related). See https://github.com/ICIJ/datashare/issues/938
Malaysian government documents and reports often have Bahasa Melayu (ms) and English (en), as well as less commonly Chinese (zh) and Tamil (ta) as well.
Currently entity extraction isn't run on a document, if it is detected as being (ms) language and this is because none of the pipelines support (ms) at this time. However the parts that are in English, could possibly have entities extracted if we could tag a document with multiple languages eg. ms,en so that the entity extraction pipelines would still run for alternate supported language.
Example document: https://pardocs.sinarproject.org/documents/commitees/public-accounts-committee/parlimen-ke-14/dr-7-2019-dr-7_ocr.pdf