Closed matthewLiem closed 4 years ago
Hi @matthewLiem, thanks for report this. Does this issue always happen?
yup - it's happening consistently. i've reduced input data to a single file and still seeing error. Also regardless of upsert/delete, i get the error. will see what else i can run to help isolate the issue and try to produce a sample data file for repro
Got it, trying to figure out it. Also, it's nicer if there is a sample data can reproduce this. :)
Thanks for the help @lamber-ken. The issue was due to a mismatch in data types between the hudi table and the DF we're looking to UPSERT. Casting properly and ensuring schema matched types across both resolved the issue.
@matthewLiem welcome :)
Record some notes:
Change lit(123456)
to lit(123456L)
val updateDF = inputDF.withColumn("run_detail_id", lit(123456))
Reproduce steps:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
val inputDataPath = "file:///tmp/test/ttt/*"
val hudiTableName = "hudi_identity"
val hudiTablePath = "file:///tmp/test/nnn"
val hudiOptions = Map[String,String](
"hoodie.datasource.write.recordkey.field" -> "auth_id",
"hoodie.table.name" -> hudiTableName,
"hoodie.datasource.write.precombine.field" -> "last_mod_time")
// create
val inputDF = spark.read.format("parquet").load(inputDataPath)
inputDF.write.format("org.apache.hudi").options(hudiOptions).mode("Overwrite").save(hudiTablePath)
// update
val inputDF = spark.read.format("parquet").load(inputDataPath)
val updateDF = inputDF.withColumn("run_detail_id", lit(123456))
updateDF.write.format("org.apache.hudi").options(hudiOptions).mode("Append").save(hudiTablePath)
@lamber-ken https://github.com/apache/hudi/issues/1802 looks similar, any idea?
Thanks for the help @lamber-ken. The issue was due to a mismatch in data types between the hudi table and the DF we're looking to UPSERT. Casting properly and ensuring schema matched types across both resolved the issue.
I am reading snowflake tables and writing to S3 in hudi formats, facing this issue consistently, How do I identify which columns to cast? because Reading existing data from S3 and reading incremental data from snowflake into Spark, the both dataframes have same data types.
@nsivabalan any suggestions on fixes for this, this is causing issues in our production
Hello - using EMR (hudi 0.5, spark 2.4.4) and during upsert i'm running into the below error:
There were similar issues posted before, but not specific to ParquetDecodingException. I'm able to read the hudi table/data set directly so i dont think its related to the parquet files. I am trying to trim down the data and provide a repro but checking if anyone has pointers. This is a non partitioned table and i'll see if its related to RECORDKEY_FIELD_OPT_KEY or PRECOMBINE_FIELD_OPT_KEY.
here are some other details: