FEATURE: Add Persian language support package

jlstro commented 1 year ago

Is your feature request related to a problem? Please describe. Aleph's OCR is not recognizing Persian language documents

Describe the solution you'd like Add the tesseract Persian language pack

Describe alternatives you've considered Arabic kind of works, but clearly is not what the users want

stchris commented 1 year ago

tillprochaska commented 2 months ago

Components to consider when adding new language support:

[ ] Add relevant Tesseract models
[ ] Add relevant spaCy models (However, Aleph supports OCR for some languages even though there is no NER model available, so this is probably optional.)
[ ] Does this require changes to ES index settings (Probably not, as long as the ICU plugin supports the new language.)
[ ] Check that transliteration works.
[ ] Check if updates to xref are required (Probably not, as that works with transliterated text.)

Bonus: Document the process for adding language support as we go.

alephdata / aleph