diasks2 / pragmatic_tokenizer

A multilingual tokenizer to split a string into tokens
MIT License
90 stars 11 forks source link

splitting of words with # prefix at hyphen #16

Closed maia closed 8 years ago

maia commented 8 years ago

Currently strings with a # prefix and a hyphen are split at the hyphen, which I like, but might not be intended, as I thought this behavior should be defined by the value of :long_word_split, see here:

> PragmaticTokenizer::Tokenizer.new("#ab-cd").tokenize
=> ["#ab", "cd"]
diasks2 commented 8 years ago

OK, I see. This is conflicting with these specs:

it 'handles hashtags 2' do
 text = "This is the #upper-#limit"
 pt = PragmaticTokenizer::Tokenizer.new(text,
   punctuation: 'none'
  )
  expect(pt.tokenize).to eq(["this", "is", "the", "#upper", "#limit"])
end

it 'handles hashtags 3' do
  text = "The #2016-fun has just begun."
  pt = PragmaticTokenizer::Tokenizer.new(text,
    punctuation: 'none'
  )
  expect(pt.tokenize).to eq(["the", "#2016", "fun", "has", "just", "begun"])
end

I think the two above specs should be changed to include hashtags: :keep_and_clean.

diasks2 commented 8 years ago

Ok, I've made some updates: https://github.com/diasks2/pragmatic_tokenizer/commit/d6c4ac72bdd934a9bdebe7b823a6a09996be656a

Please check these two specs. I'm still not exactly clear on what your desired output would be: https://github.com/diasks2/pragmatic_tokenizer/blob/d6c4ac72bdd934a9bdebe7b823a6a09996be656a/spec/languages/english_spec.rb#L748-L764

maia commented 8 years ago

I'm not sure. :) Personally I like that hashtags are treated differently, and your suggestion of only doing so with hashtags: :keep_and_clean sounds like the best way to go.

diasks2 commented 8 years ago

OK, I'll keep it like that for now, but feel free to reopen this issue if new examples pop up that we might want to treat differently.