apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
317 stars 117 forks source link

Forward incompatible types introduced when writing Iceberg data #887

Open syun64 opened 6 days ago

syun64 commented 6 days ago

Apache Iceberg version

None

Please describe the bug 🐞

Through the introduction of https://github.com/apache/iceberg-python/pull/807 we have introduced large_* types in the parquet files, which cannot be read using an earlier version of PyIceberg: TypeError: Unsupported type: large_string

Although the parquet types are the same, there must be an encoding detail that instructs pyarrow to read these as large_* types on read.

Therefore, instead of defaulting to large_* types, we should default the types to small types on write.

kevinjqliu commented 6 days ago

So the current version of pyiceberg can write parquet files with the large_string data type. But the older version of pyiceberg cannot read the parquet file with the large_string data type.

I feel like this is a library versioning problem and its ok to not be backwards compatible, esp before the 1.0 version.

My opinion is that we should be able to support both string and large_string data types. And if supporting large_string type means the library won't be backwards compatible, that is ok.