[SUPPORT] Spark Read Hudi Tables with WARN Message

apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.

https://hudi.apache.org/

Apache License 2.0

5.32k stars 2.41k forks source link

[SUPPORT] Spark Read Hudi Tables with WARN Message #10828

Closed michael1991 closed 2 months ago

michael1991 commented 6 months ago

Describe the problem you faced

I'm using Spark 3.3.2 with Hudi v0.14.1 package to read hudi table(0.12.3), then I will get warn message as below:

WARN  HoodieFileIndex:367 - Met incompatible issue when converting to hudi data type, rollback to list by prefix directly

What does it mean? How to avoid this warning?

Environment Description

Hudi version : 0.14.1
Spark version : 3.3.2
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) : GCS
Running on Docker? (yes/no) : no

ad1happy2go commented 6 months ago

@michael1991 Can you let us know more about when are you getting this issue. This is expected when the partition column datatype can't relate to hudi data type based on below mapping. Can you provide writer configuration and schema of the table.

michael1991 commented 6 months ago

Thanks for your quick response! @ad1happy2go

Spark writer configuration:

newDetails.write.format("hudi")
          .option(RECORDKEY_FIELD.key(), "id")
          .option(PRECOMBINE_FIELD.key(), "id")
          .option(SCHEMA_EVOLUTION_ENABLED.key(), "true")
          .option(DATABASE_NAME.key(), "db_prod")
          .option(INDEX_TYPE.key(), "BLOOM")
          .option(COMBINE_BEFORE_UPSERT.key(), "false")
          .option(INSERT_PARALLELISM_VALUE.key(), "30")
          .option(UPSERT_PARALLELISM_VALUE.key(), "200")
          .option(CLEANER_COMMITS_RETAINED.key(), "4")
          .option(ASYNC_CLEAN.key(), "false")
          .option(PARTITIONPATH_FIELD.key(), "req_date,req_hour")
          .option(TBL_NAME.key(), TBL_LOG_INCREMENT_DETAILS_NAME)
          .option(OPERATION.key(), INSERT_OPERATION_OPT_VAL)
          .option(WRITE_PAYLOAD_CLASS_NAME.key(), CUSTOM_PAYLOAD_CLASS)
          .option(COPY_ON_WRITE_RECORD_SIZE_ESTIMATE.key(), "64")
          .mode(SaveMode.Append).save(TBL_LOG_INCREMENT_DETAILS_PATH)

Schema just like: id string, req_date string, req_hour string, etc.

ad1happy2go commented 6 months ago

Can you do newDetails.printSchema and check the datatype of req_date and req_hour

michael1991 commented 6 months ago

ad1happy2go commented 6 months ago

@michael1991 Ideally you should not see this warning if both are strings. When are you getting this issue? On Read or Write? Can you provide some reproducible script if possible?

michael1991 commented 6 months ago

I'm using Spark 3.3.2 with Hudi 0.12.3 to write table a, meanwhile i'm using Spark 3.3.2 with Hudi 0.14.1 to read from table a and write to table b. All tables are COW type, just dataframe.write.mode("append").format("hudi").save(path) and spark.read.format("hudi").load(path). Or is it possible to enable debug log to find more helpful things?

ad1happy2go commented 6 months ago

@michael1991 Ohh, I was checking 0.14.1. Not sure if you will find anything useful in debug logs but you can turn it on like this -

spark.sparkContext.setLogLevel("DEBUG")

michael1991 commented 6 months ago

I find HoodieFileIndex.scala:367 at branch "release-0.14.1" as below

Seems using index to filter files from metadata (partition prune), am i right? If so, the table is using bloom filter with metadata disabled, would it be the root cause?

ad1happy2go commented 6 months ago

@michael1991 Not sure if this one is related to warning.

michael1991 commented 2 months ago

@michael1991 Not sure if this one is related to warning.

Hi @ad1happy2go , I resolved this warning message by setting "hoodie.datasource.write.partitionpath.urlencode" -> "true",, seems if I enable Hive Style Partitioning, urlencode setting must be set to avoid this warning message.