Open amitkdutta opened 2 months ago
CC: @yhwang @kgpai @kagamiori
I looked into this problem: the c++ stemmer library is not working well if letters are capitalized for example: stem(Generally) ==> Gener but stem(generally) ==> general To avoid this problem, velox is converting to lowercase https://github.com/facebookincubator/velox/blob/main/velox/functions/prestosql/WordStem.h#L106
In java we are not converting to lowercase: https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/WordStemFunction.java
We can try to convert the stemmed word back to capitalized form, but the unicode implementation of it will be somewhat complex. Alternatively, we can only try to convert back the ascii strings to their capitalized form. Happy to know more thoughts.
Bug description
@spershin found that word_stem library does not preserve the capitalization of words like Presto Java
In Presto Java:
In Velox/Presto C++, same query returns
System information
Any platform
Relevant logs