Open akshayakp97 opened 9 months ago
I don't think I'm following the logic here. Is there a case where you're not seeing columns being properly pruned?
Thanks for your response.
I am looking at TPCDS q16 physical plan for Iceberg on EMR.
Link to q16 - https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q16.sql
The physical plan looks like - https://gist.github.com/akshayakp97/102715c66eee44bc6f72493f427528f8
Line 46 projects only two columns from Project [cs_warehouse_sk#54840, cs_order_number#54843L]
, however it looks like Iceberg is scanning all columns for the catalog_sales
table in Line 47.
Upon further digging, I found out that ColumnPruning
rule adds the new Project [cs_warehouse_sk#54840, cs_order_number#54843L]
operator, but we still see all columns read by the corresponding BatchScanExec.
After ColumnPruning
adds the new Project [cs_warehouse_sk#54840, cs_order_number#54843L]
, when V2ScanRelationPushDown
rule triggers, it doesn't match the ScanOperation
pattern, because, the scan operator for catalog_sales
in the logical plan seems to have been updated in the previous iterations of V2ScanRelationPushDown
rule - which resulted in the conversion of ScanBuilderHolder
to DataSourceV2ScanRelation
.
To summarize, in this case, it looks like a new Project
was added after the creation of DataSourceV2ScanRelation
in the logical plan, causing it not prune columns.
In general, if a Project
is added after the execution of V2ScanRelationPushDown
rule - how do the columns get pruned? Or, do we not expect any new Project
's?
@aokolnychyi are you aware of this issue? It looks like some additional pruning may be done after pushdown happens?
Query engine
Query Engine: Spark 3.5.0 Apache Iceberg: 1.4.2
Question
Hi,
My understanding is that Spark Optimizer can add new
Project
operator even after V2 Relation was created. For example, it looks likeColumnPruning
optimizer rule triggers afterV2ScanRelationPushDown
here.If that's the case, then it would be expected that the columns projected by the newly added
Project
operator would prune the schema (for ex, ,like howV2ScanRelationPushDown#pruneColumns
does). But, I don't see schema pruning happening afterV2ScanRelationPushDown
for DatasourceV2. However, for DatasourceV1, I can see schema being pruned inFileSourceStrategy#apply
method beforeFileSourceScanExec
physical node is created.I don't see a similar logic in
DataSourceV2Strategy
to prune the relation's schema with the latestAttribute
's fromProject
's andFilter
's beforeBatchScanExec
is created.Is there a known gap with
DataSourceV2
?Thanks in advance!