Closed maia closed 8 years ago
The relevant code currently is at: https://github.com/diasks2/pragmatic_tokenizer/blob/master/lib/pragmatic_tokenizer/tokenizer.rb#L129
@stop_words += @language_module::STOP_WORDS if @stop_words.empty? && @filter_languages.empty?
…and the latter condition need to be deleted.
Thanks, this should be fixed now: https://github.com/diasks2/pragmatic_tokenizer/commit/38da52d79913efd47f7fc018e8d23da1c93dddd0#diff-e341bb577cd9fb5a1bb0bbe4e7238338R130
The current logic of contractions and abbreviations is: use whatever was passed as
language
option unless custom contractions/abbreviations were passed, then use these instead.Stop words have a different logic, and I find that inconsistent: if
filter_languages
have been assigned, it will ignore the language defined vialanguage
, even withough custom stop words.https://github.com/diasks2/pragmatic_tokenizer/blob/master/lib/pragmatic_tokenizer/tokenizer.rb#L125
I suggest to remove this additional check from stop words and adjust the tests. Or was there a specific reason to treat stop words differently?