ClickHouse / spark-clickhouse-connector

Spark ClickHouse Connector build on DataSourceV2 API
https://clickhouse.com/docs/en/integrations/apache-spark
Apache License 2.0
184 stars 65 forks source link

too many tasks when I read a distributed partitioned table using partition key filter #225

Open ScalaFirst opened 1 year ago

ScalaFirst commented 1 year ago

dependency:

com.github.housepower clickhouse-spark-runtime-3.2_2.12 0.5.0

my sql is : select * from xxx where dt (this is my partition key) = '2023-03-09' , the database has not value , so I think this sql will complete quickly but not. I found the filter message is :

Pushing operators to label_platform.ch_label_crowd_export Pushed Filters: EqualTo(dt,2023-03-09) Post-Scan Filters:

and total tasks count is : 1124 I think best performance tasks total should be 1 task because I have no data in dt = '2023-03-09'

pan3793 commented 1 year ago

that's a good point, we can collect more metrics during the planning phase, and eliminate task assignments for those partitions which do not contain any data.