diasks2 / pragmatic_tokenizer

A multilingual tokenizer to split a string into tokens
MIT License
90 stars 11 forks source link

urls should not be downcased #33

Open maia opened 8 years ago

maia commented 8 years ago

While it's rare that a URL uses uppercase letters, some do. And as urls are case-sensitive, they should not be transformed when using the option downcase: true, so the following is not desired:

> PragmaticTokenizer::Tokenizer.new(downcase: true).tokenize("http://test.com/UPPERCASE")
=> ["http://test.com/uppercase"]
diasks2 commented 8 years ago

I added a spec for this (https://github.com/diasks2/pragmatic_tokenizer/commit/f2198d669a9954eae3e2150f9ac4e962072472ef). This is a tough one. I can't think of a good way to do it without taking a big performance hit. If you have any ideas let me know.