apple / foundationdb

FoundationDB - the open source, distributed, transactional key-value store
https://apple.github.io/foundationdb/
Apache License 2.0
14.36k stars 1.3k forks source link

Batch GRV Rate Limit Exceeded is not always thrown #11500

Open ScottDugas opened 1 month ago

ScottDugas commented 1 month ago

The tests noted in https://github.com/FoundationDB/fdb-record-layer/issues/2813 will occasionally run forever due to this code: https://github.com/FoundationDB/fdb-record-layer/blob/200ac05041a1af712f621a27b4c5c37f9eab001c/fdb-record-layer-core/src/main/java/com/apple/foundationdb/record/provider/foundationdb/storestate/FDBRecordStoreStateCacheEntry.java#L97-L100

Where it is combining two futures. The first one: recordStore.loadRecordStoreStateAsync is doing a regular read. The second one is doing a snapshot get of SystemKeyspace.METADATA_VERSION_KEY.

The first future fails with Batch GRV request rate limit exceeded (code 1051). The second future never completes.

I have tried to reproduce this in a more isolated environment, but it is proving tricky to get it to reliably start failing with Batch GRV request rate limit exceeded.

ScottDugas commented 1 month ago

ok, I created a reasonable reproduction at: https://github.com/FoundationDB/fdb-record-layer/pull/2823/files About half the time, it fails with timeouts just for the reads of SystemKeyspace.METADATA_VERSION_KEY and Batch GRV request rate limit exceeded for the other operations. The other times it will fail with timeouts for all the operations.

PierreZ commented 4 weeks ago

I think we have stampled across this issue in simulation. We have a very basic RL's fork in Rust that we can simulate as an external workload. We found this morning a specific seed (5267156628) that is failing the same way, as transaction.get_metadata_version is hanging.

FoundationDB 7.3 (v7.3.43)
source version 412531b5c97fa84343da94888cc949a4d29e8c29
protocol fdb00b073000000

Our testfile looks like this:

[[test]]
testTitle = 'QuotaWorkload'

[[test.workload]]
testName = 'External'
libraryName = 'ldb'
workloadName = 'QuotaWorkload'
libraryPath = './target/release'
iteration_count = 50

[[test.workload]]
testName = 'RandomClogging'
testDuration = 30.0
swizzle = 1

[[test.workload]]
testName = 'Attrition'
machinesToKill = 10
machinesToLeave = 3
reboot = true
testDuration = 30.0

[[test.workload]]
testName = 'Rollback'
testDuration = 30

[[test.workload]]
testName = 'ChangeConfig'
maxDelayBeforeChange = 30.0
coordinators = 'auto'

Let me know if we can help :smile: