Closed MrAladdin closed 3 months ago
You can rollback the compaction with CIL, the cleaner would finally clean these logs, because before 1.0, the log cleaning is actually appending new log blocks to the corrupt files, which does not really clean the file instantly, these files would finally clean with the specific cleaning strategies.
rollback the compaction
I'm not sure which compact to roll back and how to locate it since it has been compacted multiple times already. If it's not addressed, will it be automatically cleared later? Is there any specific documentation on this issue? I'd like to quickly understand its principle further.
Why would there be corrupt files.I did not understand corrupt files.Does 'corrupt files' refer to logs that have already been compacted
Describe the problem you faced 1、spark structured streaming : upsert mor (record_index) 2、After compacting, there are a large number of logs with size 0, and they can never be cleared.
Please help me check again if the configuration is correct, whether there are any conflicting configuration items, and these parameters have already overwhelmed me.
Environment Description
Hudi version :0.14.1
Spark version :3.4.1
Hive version :3.1.2
Hadoop version :3.1.3
Storage (HDFS/S3/GCS..) :hdfs
Running on Docker? (yes/no) :no
Additional context .writeStream .format("hudi") .option("hoodie.table.base.file.format", "PARQUET") .option("hoodie.allow.empty.commit", "true") .option("hoodie.datasource.write.drop.partition.columns","false") .option("hoodie.table.services.enabled", "true") .option("hoodie.datasource.write.streaming.checkpoint.identifier", "lakehouse-dwd-social-kbi-beauty-v1-writer-1") .option(PRECOMBINE_FIELD.key(), "date_kbiUdate") .option(RECORDKEY_FIELD.key(), "records_key") .option(PARTITIONPATH_FIELD.key(), "partition_index_date") .option(DataSourceWriteOptions.OPERATION.key(), DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.TABLE_TYPE.key(), DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) .option("hoodie.combine.before.upsert", "true") .option("hoodie.datasource.write.payload.class","org.apache.hudi.common.model.OverwriteWithLatestAvroPayload")