apache / kafka

Mirror of Apache Kafka
Apache License 2.0
28.78k stars 13.95k forks source link

KAFKA-17515: Fix flaky RestoreIntegrationTest.shouldInvokeUserDefinedGlobalStateRestoreListener #17187

Closed chenyulin0719 closed 1 month ago

chenyulin0719 commented 1 month ago

It's regarding KAFKA-17515. Found two issues in the flaky tests: (Put the log analysis under Jira comments.)

  1. The error "java.nio.file.DirectoryNotEmptyException" occurs if the flush() of kafkaStreams.close() and purgeLocalStreamsState() are triggered in the same time. (The current timeout is 5 sec, which is too short since the CI is unstable and slow).
  2. Racing issue: Task to-be restored in ks-1 are rebalanced to ks-2 before entering active restoring state. So no onRestoreSuspend() was triggered.

To solve the issues:

  1. Remove the timeout in kafkaStreams.close()
  2. Ensure all tasks in ks-1 are active restoring before start second KafkaStreams(ks-2)

Committer Checklist (excluded from commit message)