facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.53k stars 1.16k forks source link

word_stem() decapitalizes words in Velox, but Java preserves capitalization. #11012

Open amitkdutta opened 2 months ago

amitkdutta commented 2 months ago

Bug description

@spershin found that word_stem library does not preserve the capitalization of words like Presto Java

In Presto Java:

SELECT word_stem(c0) from (values ('Reston'), ('Earnings'), ('Q2'), ('PID'), ('RuntimeError'), ('PRIZE')) t(c0);
    _col0             
--------------
 Reston       
 Earn         
 Q2           
 PID          
 RuntimeError 
 PRIZE        
(6 rows)

In Velox/Presto C++, same query returns

SELECT word_stem(c0) from (values ('Reston'), ('Earnings'), ('Q2'), ('PID'), ('RuntimeError'), ('PRIZE')) t(c0);
    _col0             
--------------
 reston       
 earn         
 q2           
 pid          
 runtimeerror 
 prize        
(6 rows)

System information

Any platform

Relevant logs

No specific logs
amitkdutta commented 2 months ago

CC: @yhwang @kgpai @kagamiori

pansatadru commented 1 month ago

I looked into this problem: the c++ stemmer library is not working well if letters are capitalized for example: stem(Generally) ==> Gener but stem(generally) ==> general To avoid this problem, velox is converting to lowercase https://github.com/facebookincubator/velox/blob/main/velox/functions/prestosql/WordStem.h#L106

In java we are not converting to lowercase: https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/WordStemFunction.java


We can try to convert the stemmed word back to capitalized form, but the unicode implementation of it will be somewhat complex. Alternatively, we can only try to convert back the ascii strings to their capitalized form. Happy to know more thoughts.