Closed nicornk closed 2 years ago
Seems like our spark instance is configured with
spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
which I learned from here: https://stackoverflow.com/questions/56582539/how-to-save-spark-dataframe-to-parquet-without-using-int96-format-for-timestamp
Hi @nicornk,
The error message means a type mismatch between what we get from the parquet file and what the database expects the UDF to return. And, as you said, probably the configuration of your spark cluster produces a column with a different type than usual. We will discuss what we can do in that case and come back to you.
Hello @nicornk,
Thanks for the feedback! At the moment, we do not support micros with decimals (currently supported Parquet mappings). I think there was a reason we did not add it initially, but, we are going to look into it.
Since it is already defined in the logical type as MICROS, I guess we could support it also.
message spark_schema {
optional int64 CreateDate (TIMESTAMP(MICROS,true));
}
Okay, the main reason that it was not supported initially, is that Exasol database only supports until milliseconds. For reference: https://docs.exasol.com/sql_references/data_types/datatypedetails.htm#DateTimeDataTypes.
But maybe this is okay, since we can still read Parquet data into timestamp until millis. Would that be okay from your side?
@morazow Yes, that would absolutely be okay for us.
I was looking through the code and trying to make the adoption myself to truncate the timestamp, but I think it would take me significantly more time to contribute this compared to you.
Thank you
Hello @nicornk,
Changes added in #182, we are planning a new release by the end of today, or tomorrow morning. Thanks again for the feedback!
@morazow We are maintaining our own fork of the cloud-storage-extension, I am currently updating and will report back my test findings.
@morazow All tests were successful on our stack. thanks again.
Morning @nicornk, That is great news! Just for your information, we released new 2.3.0 version with this feature.
Hello,
we are struggling to import parquet files with an apache-spark Timestamp type into Exasol using the cloud-storage-extension. I have created a minimal reproducible example with one column and one row with the following parquet schema:
We are using the following DDL statement to create the table:
This is the stacktrace from the UDF. Any idea what could be the root cause here? I have attached the parquet file (remove the .txt extension, otherwise Github would not let me upload the file.
part-00000-d9c3bf30-28d6-4246-b4d4-f1a43761d8a3-c000.snappy.parquet.txt
Thanks a lot in advance for your analysis and help.
Nicolas