Open jeqo opened 2 months ago
Sorry, I'm failing to understand what kind of race condition you are talking about? Could you please clarify?
@AnatolyPopov ofc, sorry it wasn't explained properly. I have added more details on the description. This PR at least try to fix one of the known (now) causes for flaky failing tests:
I wonder why at all this can happen. This basically means that the listener is running for a specific (key, value) pair multiple times if I understand correctly. Or is it tests only thing and the test itself cleans the file?
@AnatolyPopov I have refactored the test to have a time-based eviction and have more consistent results (before it tested if either value 1 or 2 were deleted, not it tests if 1 or 2 or both are deleted).
I have separated the exception handling for missed file, as it's nice to have but it doesn't fixes the flakiness completely. The refactoring of the test is what is trying to fix the flakiness. These are two separated commits now. PTAL
Finally, some additional evidence that this test is flaky:
Also, the same test but for the memory based cache is failing on main: https://github.com/Aiven-Open/tiered-storage-for-apache-kafka/actions/runs/10283516010/job/28457507975
Managed to reproduce locally with @RepeatedTest(1000)
:
[2024-08-08 20:20:35,517] INFO CacheConfig values:
retention.ms = -1
size = 18
(io.aiven.kafka.tieredstorage.config.CacheConfig:370)
Condition with Lambda expression in io.aiven.kafka.tieredstorage.fetch.index.MemorySegmentIndexesCacheTest$CacheTests was not fulfilled within 30 seconds.
org.awaitility.core.ConditionTimeoutException: Condition with Lambda expression in io.aiven.kafka.tieredstorage.fetch.index.MemorySegmentIndexesCacheTest$CacheTests was not fulfilled within 30 seconds.
at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
at org.awaitility.core.CallableCondition.await(CallableCondition.java:78)
at org.awaitility.core.CallableCondition.await(CallableCondition.java:26)
at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:1006)
at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:975)
at io.aiven.kafka.tieredstorage.fetch.index.MemorySegmentIndexesCacheTest$CacheTests.sizeBasedEviction(MemorySegmentIndexesCacheTest.java:262)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
at java.base/java.util.stream.IntPipeline$1$1.accept(IntPipeline.java:180)
at java.base/java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:104)
at java.base/java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:711)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:276)
at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
Where condition is:
await()
.atMost(Duration.ofSeconds(30))
.pollDelay(Duration.ofSeconds(2))
.pollInterval(Duration.ofMillis(10))
.until(() -> !mockingDetails(removalListener).getInvocations().isEmpty());
Cache removal listener-related tests (DiskChunkCacheMetricsTest and MemorySegmentIndexesCacheTest) are flaky. Recent evidence:
To reproduce this locally,
@RepeatedTest(10000)
has been used.The failure is caused by the timeout condition when waiting for a cache entry to be removed:
Waiting for RemovalListener to be called just after inserting a couple of entries seem to not been deterministic, and retention.ms time boundary is needed to get the removal called within the time-frame of the test (default retention.ms = 10min).
As a separate finding, while running this locally, I spot the exception of file not found after some thousand runs:
There seem to be multiple calls to this listener happening concurrently, causing this behavior (first caller to win, and the next one to don't find the file), so an additional handling is has been added. At runtime this exception is swallow by the listener execution, so this is mostly to have better logging when this happens.
This seems to be expected looking at the Caffeine docs:
The RemovalListener states:
Also