Closed fbocse closed 5 years ago
@fbocse, INT96 timestamps are not supported in the Iceberg spec. Iceberg has strict requirements about how types are stored to guarantee interoperability and INT96 timestamps don't meet those standards.
Instead of writing data with that timestamp format, you can use the Spark integration or Iceberg helper methods to produce the data files.
In Spark, you'd do this:
df.write.format("iceberg").save("hdfs://nn/path/to/table")
With helpers, you can create files directly like this:
appender = Parquet.write(out).forTable(table).createWriterFunc(ParquetAvroWriter::buildWriter).build()
appender.add(record)
appender.close()
Here's an example from tests: https://github.com/Netflix/iceberg/blob/master/spark/src/test/java/com/netflix/iceberg/spark/data/TestParquetAvroWriter.java#L80-L85
@rdblue thank you very much for the detailed explanation. Looking for relevant literature on this aspect on the net I bumped into this old PR https://github.com/apache/parquet-format/pull/49 where your explanations also seemed very insightful. Loved the ending though :)
Long story short: This is a nightmare.
Writing to parquet a very basic collection such as...
and generating the Iceberg schema from the Spark schema with
com.netflix.iceberg.spark.SparkSchemaUtil#convert(org.apache.spark.sql.types.StructType)
then trying to load data from disk using "iceberg" format I get
Should I put together a more comprehensive integration test on this? Counting on this not being an issue and it's just something that I'm missing here 👍
Stack trace