Open fvaleye opened 3 years ago
Did the loop on this ever get closed? I've ran into this a few times when adding parquet files to delta tables because the timestamps are written with different configurations.
Since parquet 2.6 has a great int64 timestamp nanos type, could delta standardize on top of that? Java also has nanosecond precision
Iceberg is adding nanosecond type too: https://github.com/apache/iceberg/pull/8683
@alippai that's great! Unfortunately for Delta we are bound by what the delta protocol states :(
@ion-elgreco how can we extend the delta protocol? I thought this is the correct issue / repo for that.
@ion-elgreco how can we extend the delta protocol? I thought this is the correct issue / repo for that.
It's the correct repo, but it needs to get accepted in the protocol first
Hello!
Coming from the Delta-RS community, I have several questions regarding the timestamp type in the DeltaTable schema serialization saved in the transaction log.
Context The transaction protocol schema serialization format specifies the schema serialization format for the timestamp type with the following precision:
It means that Spark uses a timestamp with microsecond precision here given a local or given timezone. But when Spark writes timestamp values out to non-text data sources like Parquet using Delta, the values are just instants (like timestamp in UTC) that have no time zone information.
Taking that into account, if we look at the configuration "spark.sql.parquet.outputTimestampType" here, we see that the default output timestamp used is "ParquetOutputTimestampType.INT96.toString". This timestamp used by this default is with a nanosecond precision when writing
.parquet
files. But it also could be changed toParquetOutputTimestampType.INT64
withTIMESTAMP_MICROS
orParquetOutputTimestampType.INT64
withTIMESTAMP_MILLIS
.Use-case When I am applying a transaction log schema on a DeltaTable (using timestamp with the microsecond precision here), I have a mismatched between the precision of the timestamp given by the schema of the protocol and the real one:
.parquet
files is with a nanosecond precision because it uses the default outputTimestampType (but could be microseconds or milliseconds depending on the configuration)Questions
Why the precision of the timestamp is not written with the timestamp type inside the schema of the transaction log? It will be used if we want to get the DeltaTable schema timestamp precision if we read the DeltaTable without the Spark dependency.
Does it means that the precision of the timestamp with microsecond precision for internal Spark/Delta is for internal processing only? In other words, the schema of parquet files must only be directly read from the
.parquet
files and not from the DeltaTable transaction protocol.If we change the default timestamp precision to nanoseconds here for applying the schema on .parquet files, it will work only for the default spark.sql.parquet.outputTimestampType configuration, but not for the TIMESTAMP_MICROS and TIMESTAMP_MILLIS ones, right?
Thank you for your help!