facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
https://velox-lib.io/
Apache License 2.0
3.4k stars 1.11k forks source link

Thoughts on fixing semantic conflict issues for common functions #4876

Open PHILO-HE opened 1 year ago

PHILO-HE commented 1 year ago

Description

As we know, for presto function under prestosql folder, if there are some semantic conflict issues for spark, we generally fix them by re-implementing another function under sparksql folder.

Recently, we found there are some semantic conflict issues in some common functions, like cast, e.g., spark allows decimal string, like "123.5", to be cast to int type, but currently velox produces null for such case. This issue can be fixed by introducing a config to let user to control the expected behavior. I note velox has a config called kCastIntByTruncate to fix a similar issue. But if multiple configs are introduced, it makes the code complex and hard to maintain. Another thought is, we can separate the implementations, like what we did under sparksql & prestosql folders. Such approach can produce a few repeated code.

@mbasmanova @pedroerp, do you have any idea? Thanks!

mbasmanova commented 1 year ago

@PHILO-HE I'm seeing that Presto supports casting decimal to integer:

presto:di> select cast(decimal '123.4' as integer);
 _col0
-------
   123
(1 row)

Hence, we should just extend cast to add such support. CC: @majetideepak @karteekmurthys

Overall, I'd suggest to continue introducing configuration properties to alter behavior of special forms like CAST as needed.

PHILO-HE commented 1 year ago

@mbasmanova, thanks for your comment! Let me clarify a bit. The mentioned example is to cast a string which represents a decimal number, not about decimal type.

If we continue using configuration properties, do we need to cover every if/else branch? See the below possible example, where P1 & P2 are two config properties to be introduced. Or, just cover the if/else branch we really need in spark or presto.

if P1 is true {
   ...
   if P2 is true {
     ...
  } else {
     ...
  }
} else {
  ...
  if P2 is true {
   ..
  } else {
   ..
  }
}
mbasmanova commented 1 year ago

The mentioned example is to cast a string which represents a decimal number, not about decimal type.

Thank you for clarification. Presto supports casting such string to decimal. Does Spark support casting to integer?

presto:di> select cast('123.4' as decimal);
 _col0
-------
 123
(1 row)

presto:di> select cast('123.4' as integer);
Query 20230509_151255_34922_tkqte failed: Cannot cast '123.4' to INT

If we continue using configuration properties, do we need to cover every if/else branch?

I think the answer would depend on specific properties. We'll need to look into this on a case by case basis.

PHILO-HE commented 1 year ago

Thank you for clarification. Presto supports casting such string to decimal. Does Spark support casting to integer?

@mbasmanova, yes, spark supports casting it to integer, the result is 123 for your example.

PHILO-HE commented 1 year ago

Here is a summary for our findings by now regarding semantic issues in cast functions.

  1. cast from string represented decimal value to int, e.g., cast("1.5" as int), as descripted in the above.
  2. cast from negative float/double value to boolean, e.g, cast(-0.5f as bool), spark expects true, velox produces false.
  3. cast from string to other types. Spark will trim all leading/trailing utf8 whitespace (including dozens of characters) before the cast. It's almost applicable to all other target types. But velox doesn't have such behavior.
  4. cast from string to date, e.g., cast("2015/03/18" as date), spark treats it as wrong format and expects null as result, but velox doesn't and produces a valid result.

I believe there will be more conflict cases uncovered. If more and more config properties are introduced to fix them, it would definitely make the code for common cast functions too complex.

mbasmanova commented 1 year ago

@PHILO-HE How do you propose to move forward? Is there an existing documentation that explains all supported casts (with detailed semantics) for Spark? If not, consider creating one.

PHILO-HE commented 1 year ago

Hi @mbasmanova, it seems there is no spark doc to describe the semantics for cast functions. I will try to summarize one. Thanks!

mbasmanova commented 1 year ago

@PHILO-HE Thanks. Having a doc describing supports CASTs and their semantics would allow us to decide whether to create a Spark-specific CAST implementation or adjust existing implementation with more flags.

JkSelf commented 1 year ago

@mbasmanova @PHILO-HE when performing the cast(2.56 Decimal(6, 2) as int) in Velox, the result would be 3, while Spark would return 2. Filed https://github.com/facebookincubator/velox/pull/6208 to support truncate the fraction part of a decimal number when casting decimal to an integer.

pedroerp commented 1 year ago

@mbasmanova @PHILO-HE given the many corner case differences between the cast behavior between these engines, I wonder if we should provide a nicer way for engines to specialize the cast code instead of relying on flags. Like scalar functions, maybe we could provide a baseline cast implementation (the base CastExpr class) following, say, Presto semantic, and let engines overwrite its specific conversion methods?

mbasmanova commented 1 year ago

@pedroerp This might be a good idea. Maybe someone can prototype this to see what it looks like.

jinchengchenghh commented 1 year ago

Find a new semantic conflict https://github.com/facebookincubator/velox/pull/5307#discussion_r1307094037

PHILO-HE commented 11 months ago

There are a lot of cases listed in the below link for demonstrating presto's cast behavior. Will do a check for spark. https://github.com/facebookincubator/velox/blob/main/velox/docs/functions/presto/conversion.rst

rui-mo commented 10 months ago

@mbasmanova Opened https://github.com/facebookincubator/velox/pull/7377 for the base layout of Spark cast expr. Could you take a look? Thanks.

rui-mo commented 8 months ago

Below is a table summarizing the differences between Spark and Presto CAST implementations we noticed. Some of them have been solved via config, like cast_to_int_by_truncate and cast_string_to_date_is_iso_8601.

Conversion Cases Presto result Spark result
floating-point -> integral SELECT cast(12345.67 as bigint); 12346 12345
SELECT cast(127.8 as tinyint); out of range 127
SELECT cast(129.9 as tinyint); out of range -127
SELECT cast(infinity() as bigint); out of range 9223372036854775807
decimal -> integral SELECT cast(2.56 decimal(6, 2) as integer); 3 2
SELECT cast(decimal '300.001' as tinyint); out of range 44
string -> integral SELECT cast('12345.67' as int); invalid 12345
string -> decimal select cast('  1.23  ' as decimal(38, 0)); invalid 123
string -> date select cast('2015' as date); invalid 2015-01-01
and some other string patterns supported by Spark
timestamp -> string select cast(cast('2015-03-18 12:03:17' as timestamp) as varchar); 2015-03-18 12:03:17.000 2015-03-18 12:03:17
string -> timestamp select cast('2015-03-18T12:03:17' as timestamp) invalid 2015-03-18 12:03:17
select cast('0001' as timestamp) invalid 0001-01-01 00:00:00
and some other string patterns supported by Spark
string -> integral/floating-point/decimal/date/timestamp SELECT cast('\t\n\u001F 123\u000B' as int); invalid 123
and cast from string to other types, which also requires removing leading/trailing utf8 white-space before cast