apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.6k stars 1.05k forks source link

Return TableProviderFilterPushDown::Exact when Parquet Pushdown Enabled #4028

Open tustvold opened 1 year ago

tustvold commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] (This section helps Arrow developers understand the context and why for this feature, in addition to the what)

Currently even when parquet predicate pushdown is enabled, and the predicate can be fully pushed down, the physical plan still contains a FilterExec when using ListingTable

| physical_plan | ProjectionExec: expr=[service@0 as service, host@1 as host, pod@2 as pod, container@3 as container, image@4 as image, time@5 as time, client_addr@6 as client_addr, request_duration_ns@7 as request_duration_ns, request_user_agent@8 as request_user_agent, request_method@9 as request_method, request_host@10 as request_host, request_bytes@11 as request_bytes, response_bytes@12 as response_bytes, response_status@13 as response_status]                            |
|               |   CoalesceBatchesExec: target_batch_size=4096                                                                                                                                                                                                                                                                                                                                                                                                                                |
|               |     FilterExec: container@3 = backend_container_0 OR pod@2 = aqcathnxqsphdhgjtgvxsfyiwbmhlmg                                                                                                                                                                                                                                                                                                                                                                                 |
|               |       RepartitionExec: partitioning=RoundRobinBatch(8)                                                                                                                                                                                                                                                                                                                                                                                                                       |
|               |         ParquetExec: limit=None, partitions=[home/raphael/Downloads/data.parquet], predicate=container_min@0 <= backend_container_0 AND backend_container_0 <= container_max@1 OR pod_min@2 <= aqcathnxqsphdhgjtgvxsfyiwbmhlmg AND aqcathnxqsphdhgjtgvxsfyiwbmhlmg <= pod_max@3, projection=[service, host, pod, container, image, time, client_addr, request_duration_ns, request_user_agent, request_method, request_host, request_bytes, response_bytes, response_status] |
|               |        

Describe the solution you'd like

ListingTable::supports_filter_pushdown should return TableProviderFilterPushDown::Exact when

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

alamb commented 1 year ago

This is a good proposal I think -- it would skip unecessary filtering and likely make plans faster. It will become more useful (maybe even necessary) when predicate pushdown is enabled by default -- #3463

jychen7 commented 1 year ago

I think it is a good idea. I confirm it would help improve performance by 30x in https://github.com/apache/arrow-datafusion/issues/5404#issuecomment-1501221831.


I am thinking to pick this up but seems the work is not as trivial as I thought. Following are my two cents' thoughts.

Related code to improve is following https://github.com/apache/arrow-datafusion/blob/7545177001b9dc04951f0c1c2008509f3895de8e/datafusion/core/src/datasource/listing/table.rs#L736-L756

  1. The FileFormat is parquet

This is trivial, we can match self.options.format with ParquetFormat

  1. Parquet predicate pushdown is enabled

This setting could come from ParquetExec.pushdown_filters or ConfigOptions (more specifically, ConfigOptions.execution.parquet.pushdown_filters)

https://github.com/apache/arrow-datafusion/blob/7545177001b9dc04951f0c1c2008509f3895de8e/datafusion/core/src/physical_plan/file_format/parquet.rs#L393

However, when ListingTable.supports_filter_pushdown is called, ParquetExec is not created yet, nor does SessionState being passed as input.

Idea: shall we add state: &SessionState as input for ListingTable.supports_filter_pushdown, the same as ListingTable.scan

https://github.com/apache/arrow-datafusion/blob/7545177001b9dc04951f0c1c2008509f3895de8e/datafusion/core/src/datasource/listing/table.rs#L668-L670

  1. The predicate is fully pushed down by ParquetExec (not all predicates are supported)

is it making sure the row_filter is built from the predicate without error? If so, seems it requires converting a logical Expr to PhysicalExpr first.

https://github.com/apache/arrow-datafusion/blob/7545177001b9dc04951f0c1c2008509f3895de8e/datafusion/core/src/physical_plan/file_format/parquet.rs#L527-L547

tustvold commented 1 year ago

I wonder if it might be possible to always return exact for Parquet files, and to just manually insert a FilterExec for any predicates that can't be pushed down

Tagging @alamb and @crepererum who are more familiar with this part of the codebase

alamb commented 1 year ago

I'll try and look into this in more detail tomorrow

crepererum commented 1 year ago

I wonder if it might be possible to always return exact for Parquet files, and to just manually insert a FilterExec for any predicates that can't be pushed down

That's what we want to do for InfluxDB IOx as well: https://github.com/influxdata/influxdb_iox/issues/7408

alamb commented 3 weeks ago

I am going to find time sometime this week