apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
423 stars 95 forks source link

[EPIC] Add Spark expression coverage #240

Open viirya opened 1 month ago

viirya commented 1 month ago

What is the problem the feature request solves?

This is an umbrella ticket for the list of unsupported Spark expressions. This is not necessary comprehensive list of all Spark expressions because they are too many. We can start from frequently used expressions.

Hash expressions #205

Datetime expressions

Conditional expressions

Arithmetic expressions

Bitwise expressions

String expressions

Math expressions

Predicates

Null expressions

Aggregate expressions

Others

...

Describe the potential solution

No response

Additional context

No response

comphead commented 1 month ago

The list of Spark expression can be found https://spark.apache.org/docs/latest/api/sql/index.html

advancedxy commented 3 weeks ago

As an umbrella issue, If you are going to make a list of frequently used expressions, maybe you can add the hash expressions(which I created #205 earlier) as one of the categories/list.

viirya commented 3 weeks ago

I added hash expressions. Free feel to edit the expression list in the issue description to add more expressions.

comphead commented 3 weeks ago

I was thinking to add Spark Scan OneRowRelation scan support in addition to Parquet Scan. This will allow Comet be enabled when running queries like

select sqrt(2) from (select 1 union all select 2)

Once its done, we can just download all the queries from https://spark.apache.org/docs/latest/api/sql/index.html and run it automatically and see the coverage. How does it sound?

advancedxy commented 3 weeks ago

I was thinking to add Spark Scan OneRowRelation scan support in addition to Parquet Scan.

I am adding RowToColumnar support in #206. Once it's done, I think it's trivial to add RDDScanExec(which OneRowRelation is translated to as PhysicalPlan) support.

we can just download all the queries from https://spark.apache.org/docs/latest/api/sql/index.html and run it automatically and see the coverage. How does it sound?

That sounds like a great idea.

comphead commented 3 weeks ago

Another potential solution we can do is to transform OneRowRelation and to use DF PlaceholderRowExec. I'll check if its doable

advancedxy commented 3 weeks ago

Another potential solution we can do is to transform OneRowRelation and to use DF PlaceholderRowExec

Of course, that would be more performant and straightforward.

viirya commented 3 weeks ago

It sounds good if we can automatically test expression coverage, although I'm not sure if it is easy to do.

comphead commented 3 weeks ago

I have an idea to do that. Planning to create a draft soon. Basic idea is to grab queries from https://spark.apache.org/docs/latest/api/sql/index.html and create a separate task in Comet which will check if the Comet being triggered.

It will be easier to do if Comet supported OneRowRelation but even without it there is a workaround. Once all builtinn function queries done there should be some HTML with total results

advancedxy commented 3 weeks ago

Basic idea is to grab queries from https://spark.apache.org/docs/latest/api/sql/index.html and create a separate task in Comet which will check if the Comet being triggered.

Hmmm, I think you can get expression example usage directly from its annotated class. See 'org.apache.spark.sql.expressions.ExpressionInfoSuite' for how to get examples directly.

comphead commented 3 weeks ago

Basic idea is to grab queries from https://spark.apache.org/docs/latest/api/sql/index.html and create a separate task in Comet which will check if the Comet being triggered.

Hmmm, I think you can get expression example usage directly from its annotated class. See 'org.apache.spark.sql.expressions.ExpressionInfoSuite' for how to get examples directly.

Great idea, I think I'm able to fetch it as here https://github.com/apache/spark/blob/6fdf9c9df545ed50acbce1ec874625baf03d4d2e/sql/core/src/test/scala/org/apache/spark/sql/expressions/ExpressionInfoSuite.scala#L166