Open syun64 opened 6 days ago
So the current version of pyiceberg can write parquet files with the large_string
data type. But the older version of pyiceberg cannot read the parquet file with the large_string
data type.
I feel like this is a library versioning problem and its ok to not be backwards compatible, esp before the 1.0 version.
My opinion is that we should be able to support both string
and large_string
data types. And if supporting large_string
type means the library won't be backwards compatible, that is ok.
Apache Iceberg version
None
Please describe the bug 🐞
Through the introduction of https://github.com/apache/iceberg-python/pull/807 we have introduced large_* types in the parquet files, which cannot be read using an earlier version of PyIceberg:
TypeError: Unsupported type: large_string
Although the parquet types are the same, there must be an encoding detail that instructs pyarrow to read these as large_* types on read.
Therefore, instead of defaulting to large_* types, we should default the types to small types on write.