apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.37k stars 1.2k forks source link

Distribute TPCH tooling used by Datafusion as Docker Images and Github Actions #12730

Open edmondop opened 1 month ago

edmondop commented 1 month ago

Is your feature request related to a problem or challenge?

As I was working on https://github.com/apache/datafusion-ray/issues/9, I looked at how tests are run on Datafusion and I discover that the tpch tooling is used to generate TBL and then converted via Rust code here. Similar code exists under https://github.com/apache/datafusion-benchmarks/tree/main/tpch , but can't easily be consumed

Since there are an increasing number of repositories in the datafusion ecosystem such as https://github.com/apache/datafusion-python, https://github.com/apache/datafusion-ray, https://github.com/apache/datafusion and https://github.com/apache/datafusion-comet, it would be ideal if these tooling would be easier to consume for the different projects.

Describe the solution you'd like

Ideally, this tooling should be maintained only once, and repositories should consume them. I have no clear view on where we should maintain this tooling (maybe within the core datafusion itself), but I think we should publish it and make it available downstreams:

Describe alternatives you've considered

No response

Additional context

No response

austin362667 commented 1 month ago

Thank you @edmondop , I like this idea!

andygrove commented 1 month ago

I created the datafusion-benchmarks with the intent of having common benchmarking tooling that can be shared across all of the DataFusion projects.

I am now getting started on the next phase of this project which is to have automated benchmarks running against larger datasets on a nightly basis (and also on PRs) so that we can catch regressions early. Setting up k8s yaml and Docker images will be part of this effort.

edmondop commented 1 month ago

Thank you. I guess one option is to rename the effort in datafusion-test-utils ? As we have seen in the Ray code, this can be useful not just for benchmarks but for general testing