apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.45k stars 961 forks source link

[spark] Support nested col pruning #4269

Closed Zouxxyy closed 1 month ago

Zouxxyy commented 2 months ago

Purpose

to #4209, Support nested col pruning, e.g.

CREATE TABLE students (
    name STRING,
    age INT,
    course STRUCT<course_name: STRING, grade: DOUBLE>
) USING paimon;
SELECT course.grade FROM students;

will only obtain course.grade from colume-storage-format (parquet, orc)

Tests

API and Format

Documentation

Zouxxyy commented 1 month ago

Are there some configuration to disable nested projection? I am concern about bugs in nested projection, at least, we should have option to disable it.

yes, spark has a conf to enabled nestedSchemaPruning

  val NESTED_SCHEMA_PRUNING_ENABLED =
    buildConf("spark.sql.optimizer.nestedSchemaPruning.enabled")
      .internal()
      .doc("Prune nested fields from a logical relation's output which are unnecessary in " +
        "satisfying a query. This optimization allows columnar file format readers to avoid " +
        "reading unnecessary nested column data. Currently Parquet and ORC are the " +
        "data sources that implement this optimization.")
      .version("2.4.1")
      .booleanConf
      .createWithDefault(true)