JuliaText / WordTokenizers.jl

High performance tokenizers for natural language processing and other related tasks
Other
96 stars 25 forks source link

Lowercasing each token in tokenize function #57

Closed shikhargoswami closed 3 years ago

shikhargoswami commented 3 years ago

The tokenize function returns a vector of words(strings) when input string is passed. It doesn't lowercase each word by default. For example:

julia> text = "This is a this sentence"
"This is a this sentence"

julia> tokenize(text)
5-element Array{String,1}:
 "This"
 "is"
 "a"
 "this"
 "sentence"

The problem here is, in further stages, the program will treat "This" and "this" as two separate words(if not preprocessed separately). This might affect in, let's say, computing frequency of words in this vector. I want to add the small functionality of lowercasing in this function. Please correct me if i'm wrong or it is implemented elsewhere.

Ayushk4 commented 3 years ago

If I understand your query correctly, you want to lowercase your string after tokenizing.

Our tokenizers do not perform any preprocessing.

You can use lowercasing from Julia Base. If you want to process your text, you can use TextAnalysis.jl.

Please let me know if I got your query correct.

shikhargoswami commented 3 years ago

Yes, that’s exactly my query. Related to it, there is lowercasing done by default in Keras/tensorflow when tokenising to avoid any errors later. That’s why I had this doubt.

Ayushk4 commented 3 years ago

Okay.

I am closing this issue since your query has been answered.