Open hgudladona opened 2 days ago
@hgudladona This should already be fixed by https://github.com/apache/hudi/pull/9879 . Can you try with this patch ?
This patch requires us to migrate to 1.x.beta release which we are not ready to do yet, Any chance this can be back ported to 0.14.x? Also, can you kindly explain how this can remediate our situation. Are the file groups outside of the active timeline treated as uncommitted with this patch?
@ad1happy2go This patch will still not solve this problem. If you follow the code path getLatestFileSlicesBeforeOrOn
will filters the file slices using function getLatestFileSliceFilteringUncommittedFiles
which filters using filterUncommittedFiles
. Looking into this function
private Stream<FileSlice> filterUncommittedFiles(FileSlice fileSlice, boolean includeEmptyFileSlice) {
Option<HoodieBaseFile> committedBaseFile = fileSlice.getBaseFile().isPresent() && completionTimeQueryView.isCompleted(fileSlice.getBaseInstantTime()) ? fileSlice.getBaseFile() : Option.empty();
List<HoodieLogFile> committedLogFiles = fileSlice.getLogFiles().filter(logFile -> completionTimeQueryView.isCompleted(logFile.getDeltaCommitTime())).collect(Collectors.toList());
if ((fileSlice.getBaseFile().isPresent() && !committedBaseFile.isPresent())
|| committedLogFiles.size() != fileSlice.getLogFiles().count()) {
LOG.debug("File Slice (" + fileSlice + ") has uncommitted files.");
// A file is filtered out of the file-slice if the corresponding
// instant has not completed yet.
FileSlice transformed = new FileSlice(fileSlice.getPartitionPath(), fileSlice.getBaseInstantTime(), fileSlice.getFileId());
committedBaseFile.ifPresent(transformed::setBaseFile);
committedLogFiles.forEach(transformed::addLogFile);
if (transformed.isEmpty() && !includeEmptyFileSlice) {
return Stream.of();
}
return Stream.of(transformed);
}
return Stream.of(fileSlice);
}
...
public boolean isCompleted(String instantTime) {
return this.startToCompletionInstantTimeMap.containsKey(instantTime)
|| HoodieTimeline.compareTimestamps(instantTime, LESSER_THAN, this.firstNonSavepointCommit);
}
If a file slice base instant time is less than firstNonSavepointCommit, although the not in active timeline its treated as completed which is pretty similar to the current behavior. Kindly, go through the scenario I mentioned one more time and suggest of this is the right patch?
@nsivabalan could you please help with this?
Describe the problem you faced
Intermittent java.util.NoSuchElementException when writing to partitions that are out of order and not covered by the active timeline.
To Reproduce
We have a hudi job reading from kafka and writing to S3 in partitions dynamically derived from certain columns in the records in the format of tenant=xxxxx/date=YYYYMMDD. Under certain situations when the partition the new data is written into is not in the active timeline (Late arriving data), there seems to be a mismatch between the file group decided in the stage "Getting small files from partitions" and "Doing partition and writing data".
Lets say a FG id 'eef3ab7f-dc8a-40ec-856f-99010184d9f1-1' is decided as a small file in stage "Getting small files from partitions" and passed on to the "Doing partition and writing data" stage to INSERT new data and create a new base file for it, this stage fails with the following exception and fails the streamer job with exception below.
However, this operation streamer job succeeds in 2 situations
Expected behavior
We expect that there is no mismatch between the views of the stages "Getting small files from partitions" and "Doing partition and writing data" in cases when we are writing to a partition thats no not actively tracked in the active timeline.
Environment Description
Hudi version : 0.14.1
Spark version : 3.4.x
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : yes
Additional context
auto.offset.reset: latest bootstrap.servers: kafka-brokers group.id: hudi-ingest-some-group hoodie.archive.async: true hoodie.archive.automatic: true hoodie.auto.adjust.lock.configs: true hoodie.base.path: s3a://some-base-path hoodie.clean.async: true hoodie.cleaner.hours.retained: 36 hoodie.cleaner.parallelism: 600 hoodie.cleaner.policy: KEEP_LATEST_BY_HOURS hoodie.cleaner.policy.failed.writes: LAZY hoodie.clustering.async.enabled: false hoodie.combine.before.insert: false hoodie.copyonwrite.insert.auto.split: false hoodie.datasource.fetch.table.enable: true hoodie.datasource.hive_sync.database: hudi_events_v1 hoodie.datasource.hive_sync.mode: hms hoodie.datasource.hive_sync.partition_extractor_class: org.apache.hudi.hive.MultiPartKeysValueExtractor hoodie.datasource.hive_sync.partition_fields: tenant,date hoodie.datasource.hive_sync.table: some-table hoodie.datasource.hive_sync.table_properties: projection.date.type=date|projection.date.format=yyyyMMdd|projection.date.range=19700101,99990101|projection.tenant.type=integer|projection.tenant.range=-1,8675309|projection.enabled=true hoodie.datasource.meta_sync.condition.sync: true hoodie.datasource.sync_tool.single_instance: true hoodie.datasource.write.hive_style_partitioning: true hoodie.datasource.write.keygenerator.class: com.some-class-prefix.KeyGenerator hoodie.datasource.write.operation: insert hoodie.datasource.write.partitionpath.field: tenant:SIMPLE,date:SIMPLE hoodie.datasource.write.precombine.field: event_time_usec hoodie.datasource.write.reconcile.schema: false hoodie.datasource.write.recordkey.field: resource_id hoodie.deltastreamer.kafka.source.maxEvents: 75000000 hoodie.deltastreamer.schemaprovider.registry.url: http://schema-registry.some-suffix:8085 hoodie.deltastreamer.source.kafka.enable.commit.offset: true hoodie.deltastreamer.source.kafka.topic: some-topic hoodie.deltastreamer.source.schema.subject: some-topic-value hoodie.fail.on.timeline.archiving: false hoodie.filesystem.view.incr.timeline.sync.enable: true hoodie.filesystem.view.remote.timeout.secs: 2 hoodie.insert.shuffle.parallelism: 1600 hoodie.memory.merge.max.size: 2147483648 hoodie.metadata.enable: false hoodie.metrics.on: true hoodie.metrics.reporter.metricsname.prefix: hoodie.metrics.reporter.prefix.tablename: false hoodie.metrics.reporter.type: DATADOG hoodie.parquet.compression.codec: zstd hoodie.streamer.source.kafka.minPartitions: 450 hoodie.table.name: <> hoodie.table.partition.fields: tenant,date hoodie.table.type: MERGE_ON_READ hoodie.write.concurrency.mode: OPTIMISTIC_CONCURRENCY_CONTROL hoodie.write.lock.dynamodb.billing_mode: PROVISIONED hoodie.write.lock.dynamodb.endpoint_url: https://dynamodb.us-east-2.amazonaws.com/ hoodie.write.lock.dynamodb.partition_key: some-key hoodie.write.lock.dynamodb.region: us-east-2 hoodie.write.lock.dynamodb.table: HudiLocker hoodie.write.lock.provider: org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider hoodie.write.markers.type: DIRECT
Additional logs
Stacktrace