DatasourceV2 does not prune columns after V2ScanRelationPushDown

apache / iceberg

Apache Iceberg

https://iceberg.apache.org/

Apache License 2.0

6.16k stars 2.14k forks source link

DatasourceV2 does not prune columns after V2ScanRelationPushDown #9268

Open akshayakp97 opened 9 months ago

akshayakp97 commented 9 months ago

Query engine

Query Engine: Spark 3.5.0 Apache Iceberg: 1.4.2

Question

Hi,

My understanding is that Spark Optimizer can add new Project operator even after V2 Relation was created. For example, it looks like ColumnPruning optimizer rule triggers after V2ScanRelationPushDown here.

If that's the case, then it would be expected that the columns projected by the newly added Project operator would prune the schema (for ex, ,like how V2ScanRelationPushDown#pruneColumns does). But, I don't see schema pruning happening after V2ScanRelationPushDown for DatasourceV2. However, for DatasourceV1, I can see schema being pruned in FileSourceStrategy#apply method before FileSourceScanExec physical node is created.

I don't see a similar logic in DataSourceV2Strategy to prune the relation's schema with the latest Attribute's from Project's and Filter's before BatchScanExec is created.

Is there a known gap with DataSourceV2?

Thanks in advance!

rdblue commented 9 months ago

I don't think I'm following the logic here. Is there a case where you're not seeing columns being properly pruned?

akshayakp97 commented 9 months ago

Thanks for your response.

I am looking at TPCDS q16 physical plan for Iceberg on EMR.

Link to q16 - https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q16.sql

The physical plan looks like - https://gist.github.com/akshayakp97/102715c66eee44bc6f72493f427528f8

Line 46 projects only two columns from Project [cs_warehouse_sk#54840, cs_order_number#54843L], however it looks like Iceberg is scanning all columns for the catalog_sales table in Line 47.

Upon further digging, I found out that ColumnPruning rule adds the new Project [cs_warehouse_sk#54840, cs_order_number#54843L] operator, but we still see all columns read by the corresponding BatchScanExec.

akshayakp97 commented 9 months ago

After ColumnPruning adds the new Project [cs_warehouse_sk#54840, cs_order_number#54843L], when V2ScanRelationPushDown rule triggers, it doesn't match the ScanOperation pattern, because, the scan operator for catalog_sales in the logical plan seems to have been updated in the previous iterations of V2ScanRelationPushDown rule - which resulted in the conversion of ScanBuilderHolder to DataSourceV2ScanRelation. To summarize, in this case, it looks like a new Project was added after the creation of DataSourceV2ScanRelation in the logical plan, causing it not prune columns.

akshayakp97 commented 9 months ago

In general, if a Project is added after the execution of V2ScanRelationPushDown rule - how do the columns get pruned? Or, do we not expect any new Project's?

rdblue commented 9 months ago

@aokolnychyi are you aware of this issue? It looks like some additional pruning may be done after pushdown happens?