Unable to read the Iceberg table in Athena that was converted from Hudi to Iceberg format using XTable

rangareddy commented 4 days ago

Search before asking

[X] I had searched in the issues and found no similar issues.

Please describe the bug 🐞

Team, I have converted Hudi table to Iceberg table using Xtable. From athena if i query the table getting the following error:

ICEBERG_BAD_DATA: Field last_modified_time's type INT64 in parquet file s3a:////.parquet is incompatible with type timestamp(6) with time zone defined in table schema This query ran against the "" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 1f0401d0-584e-4eec-8a2d-9f719a85973c

Hudi Table Schema:

CREATE EXTERNAL TABLE `default.my_table`(
  `_hoodie_commit_time` string, 
  `_hoodie_commit_seqno` string, 
  `_hoodie_record_key` string, 
  `_hoodie_partition_path` string, 
  `_hoodie_file_name` string, 
  `my_col` double, 
  `last_modified_time` bigint)
PARTITIONED BY ( 
  `partiton_id` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'hoodie.query.as.ro.table'='false', 
  'path'='s3a://<bucket_name>/my_table') 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3a://<bucket_name>/my_table'
TBLPROPERTIES (
  'bucketing_version'='2', 
  'hudi.metadata-listing-enabled'='FALSE', 
  'isRegisteredWithLakeFormation'='false', 
  'last_commit_completion_time_sync'='20241121011339000', 
  'last_commit_time_sync'='20241121011254282', 
  'last_modified_by'='hadoop', 
  'last_modified_time'='1732162935', 
  'spark.sql.create.version'='3.5.2-amzn-1', 
  'spark.sql.sources.provider'='hudi', 
  'spark.sql.sources.schema.numPartCols'='1', 
  'spark.sql.sources.schema.numParts'='1', 
  'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"_hoodie_commit_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
  {\"name\":\"_hoodie_commit_seqno\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_record_key\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}, {\"name\":\"_hoodie_partition_path\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_file_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
  {\"name\":\"my_col\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"last_modified_time\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}},
  {\"name\":\"partiton_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}', 
  'spark.sql.sources.schema.partCol.0'='partiton_id', 
  'transient_lastDdlTime'='1732162935')

Are you willing to submit PR?

[ ] I am willing to submit a PR!
[ ] I am willing to submit a PR but need help getting started!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

the-other-tim-brown commented 4 days ago

@rangareddy what is the data type for the field in the parquet file? I see that the last_modified_time is listed as bigint and also timestamp in the DDL. In Hudi, you'd need to use a logical type for a timestamp field

xushiyan commented 4 days ago

@rangareddy since you're testing with athena, you can ignore those table properties spark.sql.*. The problem is that the parquet file contains timestamp with timezone type, but the DDL makes it bigint, which violate some iceberg checks. See if there are any config to bypass this validation in iceberg, and have you tried creating the table with timestamp type for last_modified_time and that should work?

apache / incubator-xtable