Open Armelabdelkbir opened 1 year ago
@Armelabdelkbir I recommend you upgrade your Hudi version to 0.12.3 or 0.13.1 or 0.14.0. It may happen due to missing column in later records compared to previous ones. Do you have any such scenario?
missing columnar, do you mean schema evolution, sometimes we have schema evolution, but not for this usecase. what is the impact of upgrade on production i have hundred of tables and billions of rows, i need just to upgrade the hudi version and keep same metadata folders ?
@Armelabdelkbir You just need to upgrade the Hudi version. It should automatically upgrade. Were your metadata was enabled with 0.11.0? I guess it was off by default with 0.11.0. I recommend you to upgrade to 0.12.3 .
In this usecase, is your schema always consistent?
@ad1happy2go my metadata is disabled in version 0.11.0: "hoodie.metadata.enable" -> "false" , currently I can't upgrade until the client migration is complete. schema evolution happen sometimes, but it disabled on my side due to my hive version 1.2.1 by waiting if i have empty parquets problems i need just to delete them ?
You can try deleting these parquet files, although need to understand how they got created at first place.
maybe compaction produced the broken parquet file when it failed for the first time and produced the normal parquet file when it retried successfully。There will be tow parquet file with same filegroup ID and instance time, like this: xxx partition --- 00000000_1-2-3_2023110412345.parquet (broken parquet) --- 00000000_4-5-6_2023110412345.parquet (normal parquet) you can see this issue to resolve the problem. https://github.com/apache/hudi/issues/9615
@Armelabdelkbir
@watermelon12138 thanks for the link of issue i'll check, my files are in the same filegroup: example 3e2e9939-71f0-41dc-a5ff-c276ae3cdfc6-0_0-819-355182_20231108134016057.parquet (broken parquet) ccf19756-bce5-402b-b85e-64232e2f34b2-0_242-819-355161_20231108134016057.parquet(normal parquet)
I did encounter this in hudi 13.0 as well.
@victorxiang30 @Armelabdelkbir @watermelon12138 Can you provide the schema to help me to reproduce this.
If it has complex data type, can you try setting spark config spark.hadoop.parquet.avro.write-old-list-structure as false.
Describe the problem you faced
Hello community,
i'm using Hudi to change data capture with spark structured streaming + kafka + debezium , my jobs works well, sometimes few jobs failed with errors related to parquet size or format
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Write / read parquet with correct size / format
Environment Description
Hudi version : 0.11.0
Spark version : 3.1.3
Hive version : 1.2.1000
Storage (HDFS) : 2.7.3
Running on Docker? (yes/no) : no
Additional context
this problem occasionally occurs on certain tables this is my config:
MVCC conf:
Stacktrace
*for small parquet size
one day i had also this error related to parquet format: