apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.02k stars 1.14k forks source link

Integrate with the substrait integration test #10710

Open alamb opened 4 months ago

alamb commented 4 months ago

Is your feature request related to a problem or challenge?

In https://github.com/apache/datafusion/pull/10653, @Blizzara added the beginnings of testing for substrait plans that came from other systems.

@richtia noted https://github.com/apache/datafusion/pull/10653#issuecomment-2136031347

FYI...The substrait project also has a repo that aims to do integration testing between different substrait producers/consumers. https://github.com/substrait-io/consumer-testing

Describe the solution you'd like

I would like to get DataFusion working with that substrait integration test

Describe alternatives you've considered

No response

Additional context

No response

Lordworms commented 4 months ago

take

Lordworms commented 3 months ago

While I was adding this feature, I encountered lots of issues while trying to generate LogicalPlan from substrait https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans

Writing this to list all these issues

alamb commented 3 months ago

Thank you @Lordworms -- indeed it is a great job to find issues when testing

I wonder if there is some way to get the integration test partly working and checked in somewhere (maybe with the failing tests commented out). If you were able to do so, I think we would be able to get many more people to help with fixing the actual gaps

Lordworms commented 3 months ago

Thank you @Lordworms -- indeed it is a great job to find issues when testing

I wonder if there is some way to get the integration test partly working and checked in somewhere (maybe with the failing tests commented out). If you were able to do so, I think we would be able to get many more people to help with fixing the actual gaps

actually, I have made tpch_1 work just now, I'll draft a PR, and hope to get a review from you, Thanks a lot.

alamb commented 3 months ago

I added to the substrait support epic: https://github.com/apache/datafusion/issues/5173

alamb commented 3 months ago

actually, I have made tpch_1 work just now, I'll draft a PR, and hope to get a review from you, Thanks a lot.

Wow -- nice work!

richtia commented 3 months ago

Thanks for looking into these issues @Lordworms! If there are any changes you want to make in the substrait integration test repo itself, I'll gladly do those reviews for you as well!

Lordworms commented 3 months ago

Thanks for looking into these issues @Lordworms! If there are any changes you want to make in the substrait integration test repo itself, I'll gladly do those reviews for you as well!

Hi @richtia , I just find in some confusion in tpch_1_json

in general tpch1, the filter looks like

image

but in this repo, the actual tpch_1 you use is like

image

Is there any purpose of changing the expression?

Thanks a lot

richtia commented 3 months ago

Thanks for looking into these issues @Lordworms! If there are any changes you want to make in the substrait integration test repo itself, I'll gladly do those reviews for you as well!

Hi @richtia , I just find in some confusion in tpch_1_json

in general tpch1, the filter looks like image but in this repo, the actual tpch_1 you use is like image

Is there any purpose of changing the expression?

Thanks a lot

Ahh...these TPCH queries were pulled from somewhere else. If there's a mistake in any, they should definitely be updated! Feel free to file an issue or make the change yourself!

Thanks for the catch.

Blizzara commented 3 months ago

The - interval part does seem aligned with TPH-H 3.0.1 spec (page 29). The 120 should in theory be a random number betwen 60 and 120 but that's unrelevant to this usage.

Lordworms commented 2 months ago

@richtia hi! when I was trying to do plan 7,8 and 9, I find the substrait json file is empty, any reason for this? https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans

richtia commented 2 months ago

@richtia hi! when I was trying to do plan 7,8 and 9, I find the substrait json file is empty, any reason for this? https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans

The plans were generated by Isthmus (https://github.com/substrait-io/substrait-java/tree/main/isthmus) with some minor modifications. I think at the time, Isthmus had trouble generating those 3 plans. So I had just left them blank. I'll create an issue in the repo. Feel free to add them. Otherwise I'll try to get to it some time.

Lordworms commented 2 months ago

@richtia hi! when I was trying to do plan 7,8 and 9, I find the substrait json file is empty, any reason for this? https://github.com/substrait-io/consumer-testing/tree/main/substrait_consumer/tests/integration/queries/tpch_substrait_plans

The plans were generated by Isthmus (https://github.com/substrait-io/substrait-java/tree/main/isthmus) with some minor modifications. I think at the time, Isthmus had trouble generating those 3 plans. So I had just left them blank. I'll create an issue in the repo. Feel free to add them. Otherwise I'll try to get to it some time.

sure, I'll add that later.

Lordworms commented 2 months ago

Hi, @richtia I just wonder how you get the original substrait plan, since I use duckdb to be the producer and generate the plan, it didn't have some structure like

"local_files": {
                                                  "items": [
                                                    {
                                                      "uri_file": "file://FILENAME_PLACEHOLDER_0",
                                                      "parquet": {}
                                                    }
                                                  ]
                                                }
}

I know it is similar to use namedTable here as what duckDB did, but I still want to keep the consistency with previous plan. Thanks a lot.