Open cccs-jc opened 3 months ago
Thanks for the repro steps, I did some debugging with it and it seems this has to do with vectorized reads not being supported for nested fields. https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatch.java#L154 if you trace from here you'll see that if there's a nested type, we don't perform a vectorized read and we'll end up surfacing to spark that columnar execution is not supported etc.
It certainly seems like this is an area for improvement, although I don't recall the context as to why it's not supported today.
It isn't supported because we basically copied the Spark Vectorized read code and at the time we copied it nested structs weren't supported.
ha okay, so there is probably no show stopper in replicating what has been done in the Spark Vectorized read since then.
Is there plans to update the Iceberg SparkBatch soon?
@RussellSpitzer Is this something scheduled to be implemented in the next version of Iceberg ?
Feature Request / Improvement
Prior to the introduction of the ColumnarToRow in Spark 3.0.0, columnar data was converted into Spark's internal rows using generated code that copies data from ColumnBatch vectors to an internal row.
In Spark 3.0.0 and optimization was introduce which provides a row iterator which retrieves values directly from the ColumnBatch vectors thus avoiding a copy.
This optimization is not used when reading from Iceberg tables, specifically when filters are applied to nested structures. When reading using a "spark table" then the optimization is applied. When using Iceberg the optimization is applied. However, it is not applied if you put a filter on a sub-field. The code below reproduces the issue.
The query "iceberg sub filter" does not use the ColumnarToRow optimization and it takes much longer to execute.
took: 75.11750912666321 seconds == Physical Plan ==
I'm using Spark 3.5.0 and Iceberg 1.5
Query engine
Spark