diasks2 / pragmatic_tokenizer

A multilingual tokenizer to split a string into tokens
MIT License
90 stars 11 forks source link

stop words not replaceable #34

Closed maia closed 8 years ago

maia commented 8 years ago

Currently stop words are treated differently than contractions and abbreviations: as soon as one specifies filter_languages, the stop words of these languages will always be used, even if providing custom stop words. I believe should not be the case, a fix would only require removal of the latter condition in tokenizer.rb#L129. (probably some tests need to be altered too)

maia commented 8 years ago

closing this as it's a duplicate of #31.