facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
https://velox-lib.io/
Apache License 2.0
3.2k stars 1.06k forks source link

Restrict CAST of string to boolean #9833

Open kgpai opened 2 weeks ago

kgpai commented 2 weeks ago

Currently Velox is much more permissive than Presto java in casting string's to booleans. Presto only allows the strings - 1, 0, t, f, true, false and their uppercase equivalents. This PR restricts the strings that can be converted to boolean to these strings. Strings apart from these will throw.

Also a new policy SparkCastPolicy has been added so that cast behavior can still confirm to current spark behavior.

This PR breaks current spark behavior. cc: @PHILO-HE , @rui-mo .

netlify[bot] commented 2 weeks ago

Deploy Preview for meta-velox canceled.

Name Link
Latest commit 0c0d1913b873adcfea3a402470d7960e47479917
Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/664cdd32b847c500088cd872
facebook-github-bot commented 2 weeks ago

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

kgpai commented 1 week ago

@mbasmanova It affects Spark as well. Updating documentation shortly.

mbasmanova commented 1 week ago

@mbasmanova It affects Spark as well. Updating documentation shortly.

I see. Does the new behavior match Spark?

kgpai commented 1 week ago

@mbasmanova No the behavior doesnt match Spark now .

spark-sql:di> select cast('No' as boolean);
Executing query on Spark...
 CAST(No AS BOOLEAN) 
---------------------
 false               
Query Succeeded. Execution time :00:00:30
spark-sql:di> select cast('Yes' as boolean);
Executing query on Spark...
 CAST(Yes AS BOOLEAN) 
----------------------
 true                 
Query Succeeded. Execution time :00:00:01

In presto both of these will throw.

kgpai commented 1 week ago

True on latest spark too:


scala> spark.sql("select cast('Yes' as boolean)");
res0: org.apache.spark.sql.DataFrame = [CAST(Yes AS BOOLEAN): boolean]

scala> spark.sql("select cast('Yes' as boolean)").show()
+--------------------+                                                          
|CAST(Yes AS BOOLEAN)|
+--------------------+
|                true|
+--------------------+
rui-mo commented 1 week ago

@mbasmanova No the behavior doesnt match Spark now .

spark-sql:di> select cast('No' as boolean);
Executing query on Spark...
 CAST(No AS BOOLEAN) 
---------------------
 false               
Query Succeeded. Execution time :00:00:30
spark-sql:di> select cast('Yes' as boolean);
Executing query on Spark...
 CAST(Yes AS BOOLEAN) 
----------------------
 true                 
Query Succeeded. Execution time :00:00:01

In presto both of these will throw.

@kgpai Thanks for confirming the behavior of Spark. Does this imply that following this PR, we need to customize for Spark? Perhaps I could first create an issue for Spark.

kgpai commented 1 week ago

@rui-mo I can add a new policy (say SparkCastPolicy ) that preserves spark behavior as a part of this PR that would make it easier to call from the spark end. Will that work for you ?

mbasmanova commented 1 week ago

@rui-mo Do we have documenation for Spark's CAST behavior? I see docs for Presto, but can't find similar for Spark: https://facebookincubator.github.io/velox/functions/presto/conversion.html

@kgpai What is Spark's behavior for cast(varchas as boolean)? What strings are allowed and which are not?

rui-mo commented 1 week ago

@kgpai That would be great. Thank you.

rui-mo commented 1 week ago

@mbasmanova We have documentation for Spark in https://github.com/facebookincubator/velox/blob/main/velox/docs/functions/spark/conversion.rst, which only covers the conversions being discovered to have different semantics with Presto.

What is Spark's behavior for cast(varchas as boolean)? What strings are allowed and which are not?

Below is what I find in Spark used by the conversion from string to boolean.

   private[this] val trueStrings =
    Set("t", "true", "y", "yes", "1").map(UTF8String.fromString)

  private[this] val falseStrings =
    Set("f", "false", "n", "no", "0").map(UTF8String.fromString)

link: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala#L75-L79

mbasmanova commented 1 week ago

@rui-mo Looks like spark/conversion.rst is missing in velox/docs/spark_functions.rst and therefore it doesn't appear in https://facebookincubator.github.io/velox/spark_functions.html

mbasmanova commented 1 week ago

@rui-mo

Below is what I find in Spark used by the conversion from string to boolean.

Looks like the difference with Presto is that Spark allows y/yes/n/no strings, while Presto doesn't.

mbasmanova commented 1 week ago

def isTrueString(s: UTF8String): Boolean = trueStrings.contains(s.trimAll().toLowerCase) def isFalseString(s: UTF8String): Boolean = falseStrings.contains(s.trimAll().toLowerCase)

Also, looks like Spark allows spaces at the start and end of the string, but Presto doesn't.

rui-mo commented 1 week ago

Also, looks like Spark allows spaces at the start and end of the string, but Presto doesn't.

@mbasmanova Thanks for noticing. This hook is to handle the whitespace issue for Spark when casting from string. https://github.com/facebookincubator/velox/blob/main/velox/functions/sparksql/specialforms/SparkCastHooks.h#L36-L39

rui-mo commented 1 week ago

@rui-mo Looks like spark/conversion.rst is missing in velox/docs/spark_functions.rst and therefore it doesn't appear in https://facebookincubator.github.io/velox/spark_functions.html

@mbasmanova I will follow up to fix it. Thanks for the pointer.

kgpai commented 1 week ago

Folly (which is what we use currently) allows a little more strings than spark does,

presto:default> select cast(x as boolean) from (values 'No', 'y', 'Yes', 'off',  'on') as t(x);
 _col0                
-------
 false 
 true  
 true  
 false 
 true  

so the old implementation is more permissive for Spark too.

kgpai commented 1 week ago

@rui-mo I have added a new policy that confirms to current spark behavior - please review. You will still need to make changes on spark end though.

facebook-github-bot commented 1 week ago

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

rui-mo commented 1 week ago

@kgpai Thank you. It looks good, and after this PR I will follow up to fix the Spark behavior.