Open eg-kazakov opened 10 months ago
You are right, spark incremenal query does not filter out these replace commits, while Flink support it, in Flink we has sql option: read.streaming.skip_clustering
.
@danny0405 Is there any workaround for this? Right now it is blocking from loading hoodie metadata. Should we do some clean up prior to filter out replace commits?
I think that replacecommits when we perform hard partitions delete by writing empty datasets in place of existed partitions.
deletePartitionDF.write.format("hudi")
.option(HoodieWriteConfig.TBL_NAME.key, HUDI_TABLE_NAME)
.option(OPERATION.key, DataSourceWriteOptions.DELETE_PARTITION_OPERATION_OPT_VAL)
.option(DataSourceWriteOptions.PARTITIONS_TO_DELETE.key, partitionUrls)
.option(TABLE_TYPE.key, DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL)
.option(RECORDKEY_FIELD.key, HUDI_RECORD_KEY)
.option("hoodie.cleaner.policy", HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS.name())
.option("hoodie.cleaner.fileversions.retained","1")
.option(HIVE_STYLE_PARTITIONING.key, "true")
.mode(SaveMode.Append)
.save(destUrl)
We are planning to migrate all our incremental queries to base on compleiton time instead of the instant time(start time), and would support the skipping for clustering and compaction for spark at the same time, hople we can complete that in release 1.0.0.
cc @beyond1920 for visibility.
@eg-kazakov Hi, is the replacement commit generated by clustering or insert overwrite
? we could skip it if it's generated by clustering, but we could not skip it if it's generated by insert overwrite
.
@eg-kazakov Hi, is the replacement commit generated by clustering or
insert overwrite
? we could skip it if it's generated by clustering, but we could not skip it if it's generated byinsert overwrite
.
I believe it is insert overwrite. In my case I've switched partition deletion (by empty dataset overwrite) and regular saving operation, so my timeline looks like:
2023-12-08 13:48:12 256486 20231208064747732.replacecommit
2023-12-08 13:47:59 123 20231208064747732.replacecommit.inflight
2023-12-08 13:47:51 2522 20231208064747732.replacecommit.requested
2023-12-08 13:56:59 2497730 20231208064811611.clean
2023-12-08 13:52:18 2531952 20231208064811611.clean.inflight
2023-12-08 13:52:18 2531952 20231208064811611.clean.requested
That does the trick for loading phase, since I do not have situation where replace commit is last operation prior loading.
i believe guys it's because in Some hudi versions between 0.12. and 0.14. - Hudi started to write schema
in the replacecommit
and this schema for some reason includes service fields, while normal commit
does not include it.
So when Hudi tries to get the schema - it tries to load the latest commit, if there is no schema(like prior 0.14 replacecommit
have an empty string in schema
field) - tries to load schema from some datafile which is being correct.
@codope it looks to be the same as https://github.com/apache/hudi/issues/10533 the problem seems to be that Hud is inconsistently writing schema in commits - in some cases it's with metadata fields, in some not. E.g. replacecommit seems to be creating schema with metadata fields(after delete_partitions operation at least), while normal write not
I think it's introduced here https://github.com/apache/hudi/pull/5610/files so there are 2 differencies:
HoodieSchemaUtils.getLatestTableInternalSchema(hoodieConfig, tableMetaClient)
- skipping internal fieldsI see it's been changed for normal delete here https://github.com/apache/hudi/pull/9743/files, but not for delete_partitions
Nice findings ~ Would you mind to fire a fix.
@VitoMakarevich Thanks for debugging and putting up a fix. I will review the patch.
Describe the problem you faced
When spark tries to load a hoodie dataframe from datasource with next hoodie timeline setup:
I am getting: org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema:
_hoodie_commit_seqno
,_hoodie_commit_time
,_hoodie_file_name
,_hoodie_partition_path
,_hoodie_record_key
and it fails to load existed data
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I am expecting loaded data from hudi table without failures
Environment Description
Additional context During debug I've noticed that columns got duplicated: