alephdata / aleph

Search and browse documents and data; find the people and companies you look for.
http://docs.aleph.occrp.org
MIT License
2.04k stars 272 forks source link

Khmer language is not supported #1986

Closed arky closed 2 years ago

arky commented 3 years ago

It is not possible to ingest Khmer language documents for OCR as the language is available in the investigation.

image

Rosencrantz commented 3 years ago

Hi @arky

Thanks for bringing this issue to our attention. We'll look into adding Khmer into our supported languages.

Kind regards

arky commented 3 years ago

@Rosencrantz Let me know if I can pitch in and help. I would like to help out with building better ingest for Southern and South Eastern languages.

sunu commented 3 years ago

Hi @arky, we would appreciate the help for sure. We have some documentation on how to add a new language to Aleph at https://docs.alephdata.org/developers/technical-faq#how-do-i-add-support-for-a-new-language-to-aleph. The second part of the section describes how to add a new language for the ingestion pipeline.

If you could make a PR to add Khmer language support, we would be happy to merge that in. And we would be happy to help you along the process.

arky commented 3 years ago

Thank you @sunu I have added support for ingestion of Khmer documents. Unfortunately there isn't a spacy model for Khmer language yet.

sunu commented 3 years ago

Thanks a lot @arky! I'll make sure we merge your PRs in next week before the next Aleph release.

sunu commented 2 years ago

OCR support for Khmer language is now available in Aleph 3.12.0