Tokenizer via pyspark - Githubissues

Hi All --

I'm trying to use the tokenizer code from polyglot on a very large corpus of text, distributing the computation via pyspark. However, for various reasons (AFAIK), it would be much easier to distribute a pure python version of the tokenizer across the cluster. Does anyone know of such a pure python multilingual tokenizer? ATM I'm looking into uniseg, but was wondering if anyone here had any input.

Thanks

aboSamoor / polyglot

Tokenizer via pyspark #84