JuliaText / WordTokenizers.jl

High performance tokenizers for natural language processing and other related tasks
Other
96 stars 25 forks source link

Tokenize begins with full stop. #28

Closed haampie closed 5 years ago

haampie commented 5 years ago
tokenize("hello world.")
3-element Array{String,1}:
 "."    
 "hello"
 "world"

Shouldn't this return ["hello", "world", "."]?

oxinabox commented 5 years ago

Hmm how did the tests miss this? Bug is in https://github.com/JuliaText/WordTokenizers.jl/blob/master/src/words/TokTok.jl#L118 (And again later in that function) That function should not flush. It should return the final character(s). To be flushed at the end.

PR would be appreciated