diasks2 / pragmatic_tokenizer

A multilingual tokenizer to split a string into tokens
MIT License
90 stars 11 forks source link

Contractions don't remove dots #41

Open sheerun opened 5 years ago

sheerun commented 5 years ago
tokenizer =PragmaticTokenizer::Tokenizer.new({
  language: :pl,
  numbers: :all,
  downcase: false,
  contractions: { "os" => "osiedle", "os." => "osiedle" },
  expand_contractions: true
})

puts tokenizer.tokenize("Na os.Piłsudskiego")

The proper tokenization should be