JuliaText / WordTokenizers.jl

High performance tokenizers for natural language processing and other related tasks
Other
96 stars 25 forks source link

Handle final periods #33

Closed Ayushk4 closed 5 years ago

Ayushk4 commented 5 years ago

Before -

julia> toktok_tokenize("This is a sentence. ")
4-element Array{String,1}:
 "This"     
 "is"       
 "a"        
 "sentence."

Now -

julia> toktok_tokenize("This is a sentence. ")
5-element Array{String,1}:
 "This"    
 "is"      
 "a"       
 "sentence"
 "."

Also, minor changes in handle_final_periods function, to prevent re-traversing over trailing spaces at the end of the string.

oxinabox commented 5 years ago

Can you check this against the original toktok, or against the nltk toktok? Either way I am infavor but if we are dieviating we should document that this is an enhanced version of the toktok tokenizer

Ayushk4 commented 5 years ago

Nltk's toktok gives the following output for toktok.tokenize("This is a sentence. ")

 ['This', 'is', 'a', 'sentence.']
oxinabox commented 5 years ago

Ok cool. Lets add to the docstring that this is an enhanced version of the orginal toktok tokenzier