I'm trying to use the tokenizer code from polyglot on a very large corpus of text, distributing the computation via pyspark. However, for various reasons (AFAIK), it would be much easier to distribute a pure python version of the tokenizer across the cluster. Does anyone know of such a pure python multilingual tokenizer? ATM I'm looking into uniseg, but was wondering if anyone here had any input.
Hi All --
I'm trying to use the tokenizer code from
polyglot
on a very large corpus of text, distributing the computation viapyspark
. However, for various reasons (AFAIK), it would be much easier to distribute a pure python version of the tokenizer across the cluster. Does anyone know of such a pure python multilingual tokenizer? ATM I'm looking intouniseg
, but was wondering if anyone here had any input.Thanks