apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.17k stars 1.16k forks source link

Add planning benchmarks with parquet and sortedness #13098

Open alamb opened 5 hours ago

alamb commented 5 hours ago

Is your feature request related to a problem or challenge?

@mnorfolk03 added planning benchmark for more sophisticated queries here https://github.com/apache/datafusion/pull/13085 ❤️

The benchmarks are in https://github.com/apache/datafusion/blob/main/datafusion/core/benches/sql_planner.rs

However, the planning benchmarks we have now don't reflect querying an actual data source such as parquet (they query an empty in-memory table)

One thing that might be helpful to improve more would be adding a ParquetExec as well as queries that have sortedness to reflect more real world cases

Describe the solution you'd like

I would like some planning benchmarks equivalent of planning against tables like this (docs here): https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table

CREATE EXTERNAL TABLE foo STORED AS PARQUET LOCATION '..'
CREATE EXTERNAL TABLE test (
    c1  VARCHAR NOT NULL,
    c2  INT NOT NULL,
    c3  SMALLINT NOT NULL,
    c4  SMALLINT NOT NULL,
    c5  INT NOT NULL,
    c6  BIGINT NOT NULL,
    c7  SMALLINT NOT NULL,
    c8  INT NOT NULL,
    c9  BIGINT NOT NULL,
    c10 VARCHAR NOT NULL,
    c11 FLOAT NOT NULL,
    c12 DOUBLE NOT NULL,
    c13 VARCHAR NOT NULL
)
STORED AS CSV
WITH ORDER (c2 ASC, c5 + c8 DESC NULL FIRST)
LOCATION '/path/to/aggregate_test_100.csv'
OPTIONS ('has_header' 'true');

Describe alternatives you've considered

One possibility could be to add a benchmark for planning the clickbench queries: https://github.com/apache/datafusion/tree/main/benchmarks/queries/clickbench

We could either use the smaller hits.parquet file here: https://github.com/apache/datafusion/blob/main/datafusion/core/tests/data/clickbench_hits_10.parquet

Additional context

No response

Omega359 commented 3 hours ago

take