Open kgpai opened 2 weeks ago
Name | Link |
---|---|
Latest commit | 0c0d1913b873adcfea3a402470d7960e47479917 |
Latest deploy log | https://app.netlify.com/sites/meta-velox/deploys/664cdd32b847c500088cd872 |
@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@mbasmanova It affects Spark as well. Updating documentation shortly.
@mbasmanova It affects Spark as well. Updating documentation shortly.
I see. Does the new behavior match Spark?
@mbasmanova No the behavior doesnt match Spark now .
spark-sql:di> select cast('No' as boolean);
Executing query on Spark...
CAST(No AS BOOLEAN)
---------------------
false
Query Succeeded. Execution time :00:00:30
spark-sql:di> select cast('Yes' as boolean);
Executing query on Spark...
CAST(Yes AS BOOLEAN)
----------------------
true
Query Succeeded. Execution time :00:00:01
In presto both of these will throw.
True on latest spark too:
scala> spark.sql("select cast('Yes' as boolean)");
res0: org.apache.spark.sql.DataFrame = [CAST(Yes AS BOOLEAN): boolean]
scala> spark.sql("select cast('Yes' as boolean)").show()
+--------------------+
|CAST(Yes AS BOOLEAN)|
+--------------------+
| true|
+--------------------+
@mbasmanova No the behavior doesnt match Spark now .
spark-sql:di> select cast('No' as boolean); Executing query on Spark... CAST(No AS BOOLEAN) --------------------- false Query Succeeded. Execution time :00:00:30 spark-sql:di> select cast('Yes' as boolean); Executing query on Spark... CAST(Yes AS BOOLEAN) ---------------------- true Query Succeeded. Execution time :00:00:01
In presto both of these will throw.
@kgpai Thanks for confirming the behavior of Spark. Does this imply that following this PR, we need to customize for Spark? Perhaps I could first create an issue for Spark.
@rui-mo I can add a new policy (say SparkCastPolicy ) that preserves spark behavior as a part of this PR that would make it easier to call from the spark end. Will that work for you ?
@rui-mo Do we have documenation for Spark's CAST behavior? I see docs for Presto, but can't find similar for Spark: https://facebookincubator.github.io/velox/functions/presto/conversion.html
@kgpai What is Spark's behavior for cast(varchas as boolean)? What strings are allowed and which are not?
@kgpai That would be great. Thank you.
@mbasmanova We have documentation for Spark in https://github.com/facebookincubator/velox/blob/main/velox/docs/functions/spark/conversion.rst, which only covers the conversions being discovered to have different semantics with Presto.
What is Spark's behavior for cast(varchas as boolean)? What strings are allowed and which are not?
Below is what I find in Spark used by the conversion from string to boolean.
private[this] val trueStrings =
Set("t", "true", "y", "yes", "1").map(UTF8String.fromString)
private[this] val falseStrings =
Set("f", "false", "n", "no", "0").map(UTF8String.fromString)
@rui-mo Looks like spark/conversion.rst is missing in velox/docs/spark_functions.rst and therefore it doesn't appear in https://facebookincubator.github.io/velox/spark_functions.html
@rui-mo
Below is what I find in Spark used by the conversion from string to boolean.
Looks like the difference with Presto is that Spark allows y/yes/n/no strings, while Presto doesn't.
def isTrueString(s: UTF8String): Boolean = trueStrings.contains(s.trimAll().toLowerCase) def isFalseString(s: UTF8String): Boolean = falseStrings.contains(s.trimAll().toLowerCase)
Also, looks like Spark allows spaces at the start and end of the string, but Presto doesn't.
Also, looks like Spark allows spaces at the start and end of the string, but Presto doesn't.
@mbasmanova Thanks for noticing. This hook is to handle the whitespace issue for Spark when casting from string. https://github.com/facebookincubator/velox/blob/main/velox/functions/sparksql/specialforms/SparkCastHooks.h#L36-L39
@rui-mo Looks like spark/conversion.rst is missing in velox/docs/spark_functions.rst and therefore it doesn't appear in https://facebookincubator.github.io/velox/spark_functions.html
@mbasmanova I will follow up to fix it. Thanks for the pointer.
Folly (which is what we use currently) allows a little more strings than spark does,
presto:default> select cast(x as boolean) from (values 'No', 'y', 'Yes', 'off', 'on') as t(x);
_col0
-------
false
true
true
false
true
so the old implementation is more permissive for Spark too.
@rui-mo I have added a new policy that confirms to current spark behavior - please review. You will still need to make changes on spark end though.
@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@kgpai Thank you. It looks good, and after this PR I will follow up to fix the Spark behavior.
Currently Velox is much more permissive than Presto java in casting string's to booleans. Presto only allows the strings - 1, 0, t, f, true, false and their uppercase equivalents. This PR restricts the strings that can be converted to boolean to these strings. Strings apart from these will throw.
Also a new policy SparkCastPolicy has been added so that cast behavior can still confirm to current spark behavior.
This PR breaks current spark behavior. cc: @PHILO-HE , @rui-mo .