exasol / cloud-storage-extension

Exasol Cloud Storage Extension for accessing formatted data Avro, Orc and Parquet, on public cloud storage systems
MIT License
7 stars 11 forks source link

Can not import into TIMESTAMP type: E-UDF-CL-SL-JAVA-1109: emit column 'CreateDate' is of type TIMESTAMP but data given have type java.lang.Long #181

Closed nicornk closed 2 years ago

nicornk commented 2 years ago

Hello,

we are struggling to import parquet files with an apache-spark Timestamp type into Exasol using the cloud-storage-extension. I have created a minimal reproducible example with one column and one row with the following parquet schema:

message spark_schema {
  optional int64 CreateDate (TIMESTAMP(MICROS,true));
}

We are using the following DDL statement to create the table:

CREATE OR REPLACE TABLE TEST.TMP_EXA_TEST_SET (
"CreateDate" TIMESTAMP
);

This is the stacktrace from the UDF. Any idea what could be the root cause here? I have attached the parquet file (remove the .txt extension, otherwise Github would not let me upload the file.

com.exasol.ExaUDFException: F-UDF-CL-SL-JAVA-1080: Exception during run 
com.exasol.ExaDataTypeException: E-UDF-CL-SL-JAVA-1109: emit column 'CreateDate' is of type TIMESTAMP but data given have type java.lang.Long
com.exasol.ExaIteratorImpl.emit(ExaIteratorImpl.java:124)
com.exasol.cloudetl.emitter.FilesDataEmitter$$anon$1.accept(FilesDataEmitter.scala:63)
com.exasol.cloudetl.emitter.FilesDataEmitter$$anon$1.accept(FilesDataEmitter.scala:61)
com.exasol.parquetio.reader.RowParquetChunkReader.consumeRecords(RowParquetChunkReader.java:168)
com.exasol.parquetio.reader.RowParquetChunkReader.consumeRows(RowParquetChunkReader.java:146)
com.exasol.parquetio.reader.RowParquetChunkReader.read(RowParquetChunkReader.java:125)
com.exasol.cloudetl.emitter.FilesDataEmitter.$anonfun$emitParquetData$1(FilesDataEmitter.scala:61)
com.exasol.cloudetl.emitter.FilesDataEmitter.$anonfun$emitParquetData$1$adapted(FilesDataEmitter.scala:57)
scala.collection.immutable.Map$Map1.foreach(Map.scala:193)
com.exasol.cloudetl.emitter.FilesDataEmitter.emitParquetData(FilesDataEmitter.scala:57)
com.exasol.cloudetl.emitter.FilesDataEmitter.emit(FilesDataEmitter.scala:39)
com.exasol.cloudetl.scriptclasses.FilesDataImporter$.run(FilesDataImporter.scala:39)
com.exasol.cloudetl.scriptclasses.FilesDataImporter.run(FilesDataImporter.scala)
com.exasol.ExaWrapper.run(ExaWrapper.java:196)

part-00000-d9c3bf30-28d6-4246-b4d4-f1a43761d8a3-c000.snappy.parquet.txt

Thanks a lot in advance for your analysis and help.

Nicolas

nicornk commented 2 years ago

Seems like our spark instance is configured with

spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")

which I learned from here: https://stackoverflow.com/questions/56582539/how-to-save-spark-dataframe-to-parquet-without-using-int96-format-for-timestamp

tkilias commented 2 years ago

Hi @nicornk,

The error message means a type mismatch between what we get from the parquet file and what the database expects the UDF to return. And, as you said, probably the configuration of your spark cluster produces a column with a different type than usual. We will discuss what we can do in that case and come back to you.

morazow commented 2 years ago

Hello @nicornk,

Thanks for the feedback! At the moment, we do not support micros with decimals (currently supported Parquet mappings). I think there was a reason we did not add it initially, but, we are going to look into it.

morazow commented 2 years ago

Since it is already defined in the logical type as MICROS, I guess we could support it also.

message spark_schema {
  optional int64 CreateDate (TIMESTAMP(MICROS,true));
}
morazow commented 2 years ago

Okay, the main reason that it was not supported initially, is that Exasol database only supports until milliseconds. For reference: https://docs.exasol.com/sql_references/data_types/datatypedetails.htm#DateTimeDataTypes.

But maybe this is okay, since we can still read Parquet data into timestamp until millis. Would that be okay from your side?

nicornk commented 2 years ago

@morazow Yes, that would absolutely be okay for us.

I was looking through the code and trying to make the adoption myself to truncate the timestamp, but I think it would take me significantly more time to contribute this compared to you.

Thank you

morazow commented 2 years ago

Hello @nicornk,

Changes added in #182, we are planning a new release by the end of today, or tomorrow morning. Thanks again for the feedback!

nicornk commented 2 years ago

@morazow We are maintaining our own fork of the cloud-storage-extension, I am currently updating and will report back my test findings.

nicornk commented 2 years ago

@morazow All tests were successful on our stack. thanks again.

morazow commented 2 years ago

Morning @nicornk, That is great news! Just for your information, we released new 2.3.0 version with this feature.