apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.45k stars 2.43k forks source link

[HUDI-8394] Restrict multiple bulk inserts into COW with simple bucket and disabled Spark native Row #12245

Open geserdugarov opened 1 week ago

geserdugarov commented 1 week ago

Change Logs

In the case of:

we could do bulk insert into COW table multiple times. And only the first one will produce parquet files, the next one will produce log files, despite the fact that table type is COW. To prevent it, restrict of AppendHandleFactory calling for COW table is added.

Full discussion is available in https://github.com/apache/hudi/issues/12133.

Impact

No

Risk level (write none, low medium or high below)

Low

Documentation Update

No need

Contributor's checklist

geserdugarov commented 6 days ago

CI is broken on current master. Some test cases are flaky, but the problem with testSecondaryIndexWithClusteringAndCleaning looks like reproducible. Checked it here: https://github.com/apache/hudi/pull/12264

geserdugarov commented 2 hours ago

Got CI fail on not affected tests:

[ERROR] Tests run: 8, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 71.934 s <<< FAILURE! - in org.apache.hudi.functional.TestStructuredStreaming
[ERROR] testStructuredStreamingWithClustering{boolean}[1]  Time elapsed: 11.999 s  <<< ERROR!
java.util.NoSuchElementException: No value present in Option
    at org.apache.hudi.common.util.Option.get(Option.java:93)
    at org.apache.hudi.common.table.HoodieTableMetaClient.lambda$new$0(HoodieTableMetaClient.java:180)
    at org.apache.hudi.common.util.Option.orElseGet(Option.java:153)
    at org.apache.hudi.common.table.HoodieTableMetaClient.<init>(HoodieTableMetaClient.java:180)
    at org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:791)
    at org.apache.hudi.common.table.HoodieTableMetaClient.access$100(HoodieTableMetaClient.java:106)
    at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:899)
    at org.apache.hudi.HoodieDataSourceHelpers.allCompletedCommitsCompactions(HoodieDataSourceHelpers.java:126)
    at org.apache.hudi.functional.TestStructuredStreaming.waitTillAtleastNCommits(TestStructuredStreaming.scala:225)
    at org.apache.hudi.functional.TestStructuredStreaming.$anonfun$structuredStreamingForTestClusteringRunner$1(TestStructuredStreaming.scala:409)
- Test Secondary Index With Updates Compaction Clustering Deletes *** FAILED ***
  org.opentest4j.AssertionFailedError: expected: <true> but was: <false>
  ...
  at org.apache.spark.sql.hudi.command.index.TestSecondaryIndex.validateSecondaryIndex(TestSecondaryIndex.scala:370)

Will try to rebase and restart.

hudi-bot commented 1 hour ago

CI report:

Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build
geserdugarov commented 12 minutes ago

After second CI run still got not related to this MR:

- Test Secondary Index With Updates Compaction Clustering Deletes *** FAILED ***
  org.opentest4j.AssertionFailedError: expected: <true> but was: <false>
 ...
  at org.apache.spark.sql.hudi.command.index.TestSecondaryIndex.validateSecondaryIndex(TestSecondaryIndex.scala:373)

Couldn't reproduce this issue locally.