apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.51k stars 2.25k forks source link

Can't select table If drop the corresponding column after replacing or dropping partition spec field #11314

Open bknbkn opened 1 month ago

bknbkn commented 1 month ago

Apache Iceberg version

Master branch

Query engine

None

Please describe the bug 🐞

If we replaced or dropped partition spec field and drop the corresponding column, we can't select table again,

Caused by: java.lang.NullPointerException: Type cannot be null
at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:921)
at org.apache.iceberg.types.Types$NestedField.(Types.java:446)
at org.apache.iceberg.types.Types$NestedField.optional(Types.java:415)
at org.apache.iceberg.PartitionSpec.partitionType(PartitionSpec.java:141)
at org.apache.iceberg.Partitioning.buildPartitionProjectionType(Partitioning.java:273)
at org.apache.iceberg.Partitioning.partitionType(Partitioning.java:241)
at org.apache.iceberg.Partitioning.partitionType(Partitioning.java:237)
at org.apache.iceberg.spark.source.SparkTable.metadataColumns(SparkTable.java:249)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation.metadataOutput$lzycompute(DataSourceV2Relation.scala:61)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation.metadataOutput(DataSourceV2Relation.scala:51)
at org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias.metadataOutput(basicLogicalOperators.scala:1339)
at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$3(Analyzer.scala:960)
at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$3$adapted(Analyzer.scala:960)
at scala.collection.Iterator.exists(Iterator.scala:969)

It can be easily reproduced if add sql("SELECT * FROM %s", tableName); in TestAlterTablePartitionFileds.testDropColumnOfOldPartitionFieldV1 or TestAlterTablePartitionFileds.testDropColumnOfOldPartitionFieldV2

Willingness to contribute

bknbkn commented 1 month ago

The reason for this problem seems to be that each spec uses the latest schema, and historical specs may not be able to find fields in the latest schema.

I think it is necessary to persist the schma id into each spec in metadata.json. Based on this, each PartitionSpec can find its own schema when it is generated.