Open ScottDugas opened 1 month ago
ok, I created a reasonable reproduction at: https://github.com/FoundationDB/fdb-record-layer/pull/2823/files
About half the time, it fails with timeouts just for the reads of SystemKeyspace.METADATA_VERSION_KEY
and Batch GRV request rate limit exceeded
for the other operations.
The other times it will fail with timeouts for all the operations.
I think we have stampled across this issue in simulation. We have a very basic RL's fork in Rust that we can simulate as an external workload. We found this morning a specific seed (5267156628
) that is failing the same way, as transaction.get_metadata_version
is hanging.
FoundationDB 7.3 (v7.3.43)
source version 412531b5c97fa84343da94888cc949a4d29e8c29
protocol fdb00b073000000
Our testfile looks like this:
[[test]]
testTitle = 'QuotaWorkload'
[[test.workload]]
testName = 'External'
libraryName = 'ldb'
workloadName = 'QuotaWorkload'
libraryPath = './target/release'
iteration_count = 50
[[test.workload]]
testName = 'RandomClogging'
testDuration = 30.0
swizzle = 1
[[test.workload]]
testName = 'Attrition'
machinesToKill = 10
machinesToLeave = 3
reboot = true
testDuration = 30.0
[[test.workload]]
testName = 'Rollback'
testDuration = 30
[[test.workload]]
testName = 'ChangeConfig'
maxDelayBeforeChange = 30.0
coordinators = 'auto'
Let me know if we can help :smile:
The tests noted in https://github.com/FoundationDB/fdb-record-layer/issues/2813 will occasionally run forever due to this code: https://github.com/FoundationDB/fdb-record-layer/blob/200ac05041a1af712f621a27b4c5c37f9eab001c/fdb-record-layer-core/src/main/java/com/apple/foundationdb/record/provider/foundationdb/storestate/FDBRecordStoreStateCacheEntry.java#L97-L100
Where it is combining two futures. The first one:
recordStore.loadRecordStoreStateAsync
is doing a regular read. The second one is doing a snapshot get ofSystemKeyspace.METADATA_VERSION_KEY
.The first future fails with
Batch GRV request rate limit exceeded
(code 1051). The second future never completes.I have tried to reproduce this in a more isolated environment, but it is proving tricky to get it to reliably start failing with
Batch GRV request rate limit exceeded
.