tokenizer confusion - Githubissues

IndicoDataSolutions / Passage

A little library for text analysis with RNNs.

MIT License

530 stars 134 forks source link

tokenizer confusion #23

Closed MathieuCliche closed 9 years ago

MathieuCliche commented 9 years ago

I was a little confused with the tokenizer since if I follow the example in the readme I would get something like:

train_text = ['hello world', 'foo bar'] tokenizer = Tokenizer() train_tokens = tokenizer.fit_transform(train_text)

and then train_tokens is just a list of 2s, train_tokens = [[2,2],[2,2]]. I was explained that the default minimum frequency is 10, and one can change this with tokenizer = Tokenizer(min_df=1). Maybe this should be a bit more clear in the readme example... Thank!

gwulfs commented 9 years ago

By default passage only accepts tokens with a minimum document frequency, and replaces all others with the unknown token which we have set to be 2. To change this behavior, set the min_df argument when initializing the tokenizer to something other than 10.