aboSamoor / polyglot

Multilingual text (NLP) processing toolkit
http://polyglot-nlp.com
Other
2.31k stars 337 forks source link

Tokenizer via pyspark #84

Open bkj opened 7 years ago

bkj commented 7 years ago

Hi All --

I'm trying to use the tokenizer code from polyglot on a very large corpus of text, distributing the computation via pyspark. However, for various reasons (AFAIK), it would be much easier to distribute a pure python version of the tokenizer across the cluster. Does anyone know of such a pure python multilingual tokenizer? ATM I'm looking into uniseg, but was wondering if anyone here had any input.

Thanks