diasks2 / pragmatic_tokenizer

A multilingual tokenizer to split a string into tokens
MIT License
90 stars 11 forks source link

stop words and filter languages #31

Closed maia closed 8 years ago

maia commented 8 years ago

The current logic of contractions and abbreviations is: use whatever was passed as language option unless custom contractions/abbreviations were passed, then use these instead.

Stop words have a different logic, and I find that inconsistent: if filter_languages have been assigned, it will ignore the language defined via language, even withough custom stop words.

https://github.com/diasks2/pragmatic_tokenizer/blob/master/lib/pragmatic_tokenizer/tokenizer.rb#L125

I suggest to remove this additional check from stop words and adjust the tests. Or was there a specific reason to treat stop words differently?

maia commented 8 years ago

The relevant code currently is at: https://github.com/diasks2/pragmatic_tokenizer/blob/master/lib/pragmatic_tokenizer/tokenizer.rb#L129

@stop_words += @language_module::STOP_WORDS if @stop_words.empty? && @filter_languages.empty?

…and the latter condition need to be deleted.

diasks2 commented 8 years ago

Thanks, this should be fixed now: https://github.com/diasks2/pragmatic_tokenizer/commit/38da52d79913efd47f7fc018e8d23da1c93dddd0#diff-e341bb577cd9fb5a1bb0bbe4e7238338R130