IndicoDataSolutions / Passage

A little library for text analysis with RNNs.
MIT License
530 stars 134 forks source link

Tokeniser issue #43

Closed bottydim closed 8 years ago

bottydim commented 8 years ago

It seems like the Tokenizer is broken since the following code snippet: train_text = ['hello world', 'foo bar']

tokenizer = Tokenizer() train_tokens = tokenizer.fit_transform(train_text)

results in:

[[2, 2], [2, 2]]

Newmu commented 8 years ago

By default the tokenizer replaces all words with frequency less than 10 with UNKOWN tokens which are represented by the symbol 2. You can change this by calling Tokenizer(min_df=1) instead.

If you use tokenizer.inverse_transform([[2, 2], [2, 2]]) it will return ["UNK UNK", "UNK UNK"].

bottydim commented 8 years ago

Makes perfect sense, thank you very much for your help and time :)