Tokenizer class that replicates Wikipedia preprocessing tokenization

Following the request #224 to know the preprocessing steps applied to the Wikipedia, I would like to go further and suggest the creation of a Tokenizer class that would wrap the references to those libraries and make the appropriate calls.

I think it is of the utmost relevance to allow users of pre-trained embeddings to have the exact same tokenization, before mapping their word tokens to the fastText embeddings. It is awkward to expect every interested user to go through the cumbersome journey of gathering all these libraries and write python wrappers whenever necessary, just to replicate your processing.

For example, from what I gathered, this is the collection of libraries used:

Stanford word segmenter for Chinese,
Mecab (or Kuromoji ?) for Japanese,
pyvi (or UETsegmenter ?) for Vietnamese,
pythai for Thai,
jieba (or Stanford Word Segmenter ?) for Chinese,
moses ? for Latin, Cyrillic, Hebrew or Greek,
Java ICU tokenizer for Bhutanese, Khmer, Lao, Thai and Tibetan and everything else.

Moreover, it is not clear if additional abbreviation lists and similar content was used.

facebookresearch / fastText

Tokenizer class that replicates Wikipedia preprocessing tokenization #779