alpheios-project / tokenizer

Alpheios Tokenizer Service
1 stars 0 forks source link

additional language dependencies #33

Open balmas opened 3 years ago

balmas commented 3 years ago

31 identified a tokenizer error with Chinese due to a missing dependency.

Spacy documentation lists additional dependencies for a number of languages at https://spacy.io/usage/models#languages:

Japanese: Unidic, Mecab, SudachiPy Russian: pymorphy2 Ukrainian: pymorphy2 Thai: pythainlp Korean: mecab-ko, mecab-ko-dic, natto-py Vietnamese: Pyvi

@irina060981 if you can confirm the chinese fix works (and the Dockerfile fix too) maybe you can add these dependencies too?

irina060981 commented 3 years ago

@balmas - I spend all day to add these libraries - and here it is my results: I was able to add

And faced with problems for japanese and korean

Japanese needs Unidic, Mecab, SudachiPy

I was able to find versions for our environment - Unidic, Mecab

But I didn't find a working version for SudachiPy to work with Cython And was not able to install all the requirements for - flake8 flake8-import-order flake8-bulitins

There is a compiled library with SudachiPy and Cython - https://github.com/polm/fugashi But spacy requires sudacypy module (from the error)

Korean needs mecab-ko, mecab-ko-dic, natto-py

I was able to install natto-py but failed with - mecab-ko, mecab-ko-dic They failed with specific errors

I could continue with it tomorrow - it is really difficult to build the container on my evenning/night - it needs much more time. I hope the traffic of docker resources will reduce on my morning

@balmas , how do you think how much time it is worth to spend for Koreen and Japaneese support?

balmas commented 3 years ago

@balmas , how do you think how much time it is worth to spend for Koreen and Japaneese support?

@irina060981 let's not worry about those for the moment. Thanks.

monzug commented 3 years ago

Also, Telugu and Sanskrit also give a 500 error. see attachment

Screen Shot 2021-03-15 at 2 43 03 PM