Any plan to enhance ANSI compliance of Spark functions?

facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.

https://velox-lib.io/

Apache License 2.0

3.41k stars 1.12k forks source link

Any plan to enhance ANSI compliance of Spark functions? #3869

Open lviiii opened 1 year ago

lviiii commented 1 year ago

Description

We are working on Gluten and find ANSI Compliance of Spark functions is weak. When we enable ANSI mode and run Spark UTs with Velox backend, some cases got mismatched results with Vanilla Spark. For example, dividing by zero in Vanilla Spark results in an exception, but the result offloaded to Velox depends on the input types, refer to Divide Expression.

Background Spark build an ANSI compliance for better data quality and easier migration from traditional DBMS. Full details of this dialect are documented in Spark ANSI Compliance. When enable ANSI mode in Spark,

Spark will throw an exception at runtime instead of returning null results if the inputs to a SQL operator/function are invalid
Definite implicit cast syntax rules in ANSI mode, and disallow some invalid castings in ANSI SQL standard.

In the future, we also hope there are some ANSI enhancements in Velox. Is there any plan for this?

mbasmanova commented 1 year ago

CC: @rui-mo Rui, is this something you can comment on?

rui-mo commented 1 year ago

@mbasmanova Hi Masha, we once discussed the similar issue in a weekly meeting if I remember correctly, and below are some thinkings. The concern is if Velox supports different configs and behaviors, Velox can become too complicated. Another option is we could tell the customers the limitations, for example, ANSI ON is not supported. In Gluten, we are adopting the fallback method for now. That is, if ANSI is ON, the computing will not be offloaded to Velox and will be executed by vanilla Spark. Maybe we could discuss here if there is any new idea. To introduce, @lviiii Jiao is from our Gluten team and she is working on expressions support.

mbasmanova commented 1 year ago

@rui-mo @lviiii I assume ANSI compliance question applies to Spark functions only. It should be possible to provide two sets of Spark functions: legacy and ANSI-compliant. If we had to choose between these and provide only one version, then I would think providing ANSI-compliant set of functions is better. What do you think?

rui-mo commented 1 year ago

@rui-mo @lviiii I assume ANSI compliance question applies to Spark functions only. It should be possible to provide two sets of Spark functions: legacy and ANSI-compliant. If we had to choose between these and provide only one version, then I would think providing ANSI-compliant set of functions is better. What do you think?

@mbasmanova Masha, thanks for your suggestion. It looks good to me to have legacy and ANSI-compliant Spark functions, and we can call the needed one from Gluten according Spark ANSI config. BTW, ANSI is OFF in Spark so I guess legacy behaviors are also important. Another related concern is how to handle ANSI in operators. For example, the behavior when sum overflows in aggregation can be affected. In the long term, maybe we need to implement the compliant aggregation function under spark sql. How do you think?

mbasmanova commented 1 year ago

@rui-mo My impression is that ANSI compliance is limited to the function implementations: scalar, aggregate and window.

For example, the behavior when sum overflows in aggregation can be affected.

Here, the problem is with the SUM aggregate function, not the aggregation operator itself. As you mentioned, we'll need to provide ANSI-compliance packages of Spark functions which include scalar, aggregate and possibly window functions.

FelixYBW commented 1 year ago

@mbasmanova does presto has any ANSI compliance issue or plan?

mbasmanova commented 1 year ago

@FelixYBW Binwei, PrestoSQL was designed to be ANSI compliant out of the box. It follows SQL specification closely.

FelixYBW commented 1 year ago

In this way we should just use prestoSQL functions if Spark's ANSI configuration is set

lviiii commented 1 year ago

I tested the case dividing by zero in functions/prestosql. The presto function returns "Infinity" value instead of expected exception error, which don't meet ANSI compliance. @mbasmanova @FelixYBW

mbasmanova commented 1 year ago

In this way we should just use prestoSQL functions if Spark's ANSI configuration is set

@FelixYBW @lviiii I'm afraid the issue is more complicated. Somebody needs to go over each function, read its documentation carefully to understand the semantics completely, play with the function to confirm the semantics, than implement these semantics and add unit tests to ensure the desired behavior. SQL specification is not 100% complete. Folks use the term "underspecified", hence, different ANSI-compliant implementations are possible.

lviiii commented 1 year ago

In this way we should just use prestoSQL functions if Spark's ANSI configuration is set

@FelixYBW @lviiii I'm afraid the issue is more complicated. Somebody needs to go over each function, read its documentation carefully to understand the semantics completely, play with the function to confirm the semantics, than implement these semantics and add unit tests to ensure the desired behavior. SQL specification is not 100% complete. Folks use the term "underspecified", hence, different ANSI-compliant implementations are possible.

We are also concerned complicated ANSI compliance causes performance issue. I think it's great to have two sets of Spark functions: legacy and ANSI-compliant, the former is default which is consistent with Spark legacy behavior.

mbasmanova commented 6 months ago

@FelixYBW @lviiii @rui-mo Folks, wondering if anyone in Gluten team is working on ansi-compliant set of Spark functions. BTW, wondering how does this work for Gluten with ClickHouse backend?

FelixYBW commented 6 months ago

how does this work for Gluten with ClickHouse

not currently. We fallback to Vanilla Spark once ANSI is enabled. We didn't plan the work yet because no customer asked for the feature