Closed shikhargoswami closed 3 years ago
If I understand your query correctly, you want to lowercase your string after tokenizing.
Our tokenizers do not perform any preprocessing.
You can use lowercasing from Julia Base. If you want to process your text, you can use TextAnalysis.jl.
Please let me know if I got your query correct.
Yes, that’s exactly my query. Related to it, there is lowercasing done by default in Keras/tensorflow when tokenising to avoid any errors later. That’s why I had this doubt.
Okay.
I am closing this issue since your query has been answered.
The tokenize function returns a vector of words(strings) when input string is passed. It doesn't lowercase each word by default. For example:
The problem here is, in further stages, the program will treat "This" and "this" as two separate words(if not preprocessed separately). This might affect in, let's say, computing frequency of words in this vector. I want to add the small functionality of lowercasing in this function. Please correct me if i'm wrong or it is implemented elsewhere.