Closed maia closed 8 years ago
OK, I see. This is conflicting with these specs:
it 'handles hashtags 2' do
text = "This is the #upper-#limit"
pt = PragmaticTokenizer::Tokenizer.new(text,
punctuation: 'none'
)
expect(pt.tokenize).to eq(["this", "is", "the", "#upper", "#limit"])
end
it 'handles hashtags 3' do
text = "The #2016-fun has just begun."
pt = PragmaticTokenizer::Tokenizer.new(text,
punctuation: 'none'
)
expect(pt.tokenize).to eq(["the", "#2016", "fun", "has", "just", "begun"])
end
I think the two above specs should be changed to include hashtags: :keep_and_clean
.
Ok, I've made some updates: https://github.com/diasks2/pragmatic_tokenizer/commit/d6c4ac72bdd934a9bdebe7b823a6a09996be656a
Please check these two specs. I'm still not exactly clear on what your desired output would be: https://github.com/diasks2/pragmatic_tokenizer/blob/d6c4ac72bdd934a9bdebe7b823a6a09996be656a/spec/languages/english_spec.rb#L748-L764
I'm not sure. :) Personally I like that hashtags are treated differently, and your suggestion of only doing so with hashtags: :keep_and_clean
sounds like the best way to go.
OK, I'll keep it like that for now, but feel free to reopen this issue if new examples pop up that we might want to treat differently.
Currently strings with a # prefix and a hyphen are split at the hyphen, which I like, but might not be intended, as I thought this behavior should be defined by the value of
:long_word_split
, see here: