apache / incubator-xtable

Apache XTable (incubating) is a cross-table converter for lakehouse table formats that facilitates interoperability across data processing systems and query engines.
https://xtable.apache.org/
Apache License 2.0
919 stars 147 forks source link

About the support for Schema Evolution (column rename) #282

Open huan233usc opened 11 months ago

huan233usc commented 11 months ago

Hi there, I tested onetable tool with a created an Iceberg Table and run a column rename.

spark-sql> CREATE TABLE hadoop_prod.repro_rename ( id bigint, data string) using iceberg;
Response code
Time taken: 2.269 seconds
spark-sql> insert into hadoop_prod.repro_rename values (1, "abc");
Response code
Time taken: 8.093 seconds
spark-sql> select * from hadoop_prod.repro_rename;
id      data
1       abc
spark-sql> alter table hadoop_prod.repro_rename rename column id to new_id ;
Time taken: 2.255 seconds
spark-sql> select * from hadoop_prod.repro_rename;
new_id  data
1       abc

Run onetable to convert Iceberg to Hudi and Delta, and the information about the column rename doesn't seem to be captured in the converted metadata

issue 1: the schema is still using the old one

HUDI:

>>df = spark.read.format("hudi").options(**hudi_options).load("MY_PATH/repro_rename")
>>> df.show(truncate=False)
+---+----+                                                                      
|id |data|
+---+----+
|1  |abc |
+---+----+

DELTA

spark-sql> select * FROM delta.`MY_PATH/repro_rename` ;
id      data
1       abc

issue 2: Iceberg's column rename is built on top of field id, I don't see any Delta/Hudi equivalence are included in the converted metadata

for Delta: "delta.columnMapping.mode":"id","delta.columnMapping.maxColumnId":"4" is missing in the delta log -- see https://docs.databricks.com/en/delta/delta-column-mapping.html for the implementation of 'delta.columnMapping.mode' = 'id'

for Hudi : hudi commit log doesn’t have id, max_column_id populated (fields in https://github.com/apache/hudi/pull/4910/files )

vamshigv commented 11 months ago

That would be a great feature to have. Since Iceberg and Delta supports column renames we can start with supporting those. @huan233usc Do you want to pick that up ?

taher-cldcvr commented 11 months ago

@vamshigv if someone can guide me a bit I am happy to pick it up