ICIJ / datashare

A self-hosted search engine for documents.
https://datashare.icij.org
GNU Affero General Public License v3.0
596 stars 53 forks source link

Multiple languages tags for documents with mixed languages #309

Closed kaerumy closed 2 years ago

kaerumy commented 4 years ago

Malaysian government documents and reports often have Bahasa Melayu (ms) and English (en), as well as less commonly Chinese (zh) and Tamil (ta) as well.

Currently entity extraction isn't run on a document, if it is detected as being (ms) language and this is because none of the pipelines support (ms) at this time. However the parts that are in English, could possibly have entities extracted if we could tag a document with multiple languages eg. ms,en so that the entity extraction pipelines would still run for alternate supported language.

Example document: https://pardocs.sinarproject.org/documents/commitees/public-accounts-committee/parlimen-ke-14/dr-7-2019-dr-7_ocr.pdf

bamthomas commented 4 years ago

That's an interesting point.

We are using Optimaize https://github.com/optimaize/language-detector

For now, it does not support multi-langual detection :

When a text is written in multiple languages, the default algorithm of this software is not appropriate. 
You can try to split the text (by sentence or paragraph) and detect the individual parts. Running the 
language guesser on the whole text will just tell you the language that is most dominant, in the best 
case.

But it is a use case that we've already come upon.

pirhoo commented 2 years ago

Not on the roadmap yet but we will soon allow user to specify the language when extracting text (not exactly the same issue but related). See https://github.com/ICIJ/datashare/issues/938