facebookresearch / fastText

Library for fast text representation and classification.
https://fasttext.cc/
MIT License
25.89k stars 4.71k forks source link

Tokenizer class that replicates Wikipedia preprocessing tokenization #779

Open davidalbertonogueira opened 5 years ago

davidalbertonogueira commented 5 years ago

Following the request #224 to know the preprocessing steps applied to the Wikipedia, I would like to go further and suggest the creation of a Tokenizer class that would wrap the references to those libraries and make the appropriate calls.

I think it is of the utmost relevance to allow users of pre-trained embeddings to have the exact same tokenization, before mapping their word tokens to the fastText embeddings. It is awkward to expect every interested user to go through the cumbersome journey of gathering all these libraries and write python wrappers whenever necessary, just to replicate your processing.

For example, from what I gathered, this is the collection of libraries used:

Moreover, it is not clear if additional abbreviation lists and similar content was used.

davidalbertonogueira commented 5 years ago

@loretoparisi mentioned that FastBPE could be an interesting pointer for people that want to create their own, new, fastText embeddings, in a language-agnostic way, allowing them to not care about tokenization ( #725 ). However, for people wanting to use the pre-trained embeddings, such library is of no avail.