apache / incubator-xtable

Apache XTable (incubating) is a cross-table converter for lakehouse table formats that facilitates interoperability across data processing systems and query engines.
https://xtable.apache.org/
Apache License 2.0
921 stars 147 forks source link

Unable to read the Iceberg table in Athena that was converted from Hudi to Iceberg format using XTable #581

Open rangareddy opened 4 days ago

rangareddy commented 4 days ago

Search before asking

Please describe the bug 🐞

Team, I have converted Hudi table to Iceberg table using Xtable. From athena if i query the table getting the following error:

ICEBERG_BAD_DATA: Field last_modified_time's type INT64 in parquet file s3a:////.parquet is incompatible with type timestamp(6) with time zone defined in table schema This query ran against the "" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 1f0401d0-584e-4eec-8a2d-9f719a85973c

Hudi Table Schema:

CREATE EXTERNAL TABLE `default.my_table`(
  `_hoodie_commit_time` string, 
  `_hoodie_commit_seqno` string, 
  `_hoodie_record_key` string, 
  `_hoodie_partition_path` string, 
  `_hoodie_file_name` string, 
  `my_col` double, 
  `last_modified_time` bigint)
PARTITIONED BY ( 
  `partiton_id` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'hoodie.query.as.ro.table'='false', 
  'path'='s3a://<bucket_name>/my_table') 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3a://<bucket_name>/my_table'
TBLPROPERTIES (
  'bucketing_version'='2', 
  'hudi.metadata-listing-enabled'='FALSE', 
  'isRegisteredWithLakeFormation'='false', 
  'last_commit_completion_time_sync'='20241121011339000', 
  'last_commit_time_sync'='20241121011254282', 
  'last_modified_by'='hadoop', 
  'last_modified_time'='1732162935', 
  'spark.sql.create.version'='3.5.2-amzn-1', 
  'spark.sql.sources.provider'='hudi', 
  'spark.sql.sources.schema.numPartCols'='1', 
  'spark.sql.sources.schema.numParts'='1', 
  'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"_hoodie_commit_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
  {\"name\":\"_hoodie_commit_seqno\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_record_key\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}, {\"name\":\"_hoodie_partition_path\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_file_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
  {\"name\":\"my_col\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"last_modified_time\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}},
  {\"name\":\"partiton_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}', 
  'spark.sql.sources.schema.partCol.0'='partiton_id', 
  'transient_lastDdlTime'='1732162935')

Are you willing to submit PR?

Code of Conduct

the-other-tim-brown commented 4 days ago

@rangareddy what is the data type for the field in the parquet file? I see that the last_modified_time is listed as bigint and also timestamp in the DDL. In Hudi, you'd need to use a logical type for a timestamp field

xushiyan commented 4 days ago

@rangareddy since you're testing with athena, you can ignore those table properties spark.sql.*. The problem is that the parquet file contains timestamp with timezone type, but the DDL makes it bigint, which violate some iceberg checks. See if there are any config to bypass this validation in iceberg, and have you tried creating the table with timestamp type for last_modified_time and that should work?