MutationLogReaderCorrectness nightly failure - QuiteDatabase failure

sfc-gh-bvr commented 2 years ago

bin/fdbserver -r simulation -f tests/fast/MutationLogReaderCorrectness.toml -L sandbox/logs/ --trace_format json -b on -s 896089042

sfc-gh-bvr commented 2 years ago

From initial investigation (Trevor and I, 5/12), it looks like we are continuing to perform data movements for too long during the final consistency check. The MaxStorageQueueSize, DataInFlight, and DataDistributionQueueSize gates are all failing repeatedly, and quietDatabase times out.

The unique thing about this particular workload (we think) is that it does a lot of sequential writes, which could cause a skewed behaviour not tested by other simulation workloads. When this test was run on its own in a Joshua ensemble, it failed 5/10k runs.

sfc-gh-bvr commented 2 years ago

From the RelocateShardDecision traces, the counts of movement by Pri:

1 120 614 340 1 600 221 950 Nearly al movements happen at Pri 950 PRIORITY_SPLIT_SHARD and Pri 340 PRIORITY_MERGE_SHARD

sfc-gh-bvr commented 2 years ago

Given there are so many splits, shards, I annotated ShardSplitter to look for patterns - modified RelocateShardStartSplitx100 to log all the time. That yields the following, where the elided part is /blog/Ot87c7pea^I9fa481

88.168234 RelocateShardStartSplit Begin= End=ffff MaxBytes=160000 MetricsBytes=199495 Bandwidth=Low BytesPerKSec=4000000 NumShards=4 89.285237 RelocateShardStartSplit Begin=ff02...^869c1aa099 End=ffff MaxBytes=160000 MetricsBytes=332490 Bandwidth=Low BytesPerKSec=0 NumShards=7 89.601454 RelocateShardStartSplit Begin= End=ff02...^869c1aa0T MaxBytes=160000 MetricsBytes=199497 Bandwidth=Low BytesPerKSec=4000000 NumShards=4 89.657075 RelocateShardStartSplit Begin=ff02...^869c1aa0T End=ff02...^869c1aa098 MaxBytes=160000 MetricsBytes=166247 Bandwidth=Low BytesPerKSec=24000000 NumShards=4 90.330167 RelocateShardStartSplit Begin= End=ff02...^869c1aa012 MaxBytes=160000 MetricsBytes=166248 Bandwidth=Low BytesPerKSec=0 NumShards=3 90.350226 RelocateShardStartSplit Begin=ff02...^869c1aa0b3 End=ff02...^869c1aa0da MaxBytes=160000 MetricsBytes=199498 Bandwidth=Low BytesPerKSec=4000000 NumShards=4 92.045974 RelocateShardStartSplit Begin=ff02...^869c1aa0e5 End=ffff MaxBytes=160000 MetricsBytes=166248 Bandwidth=Low BytesPerKSec=0 NumShards=4 92.066450 RelocateShardStartSplit Begin=ff02...^869c1aa09e End=ff02...^869c1aa0b0 MaxBytes=160000 MetricsBytes=166247 Bandwidth=Low BytesPerKSec=4000000 NumShards=4 92.224304 RelocateShardStartSplit Begin=ff02...^869c1aa0f602 End=ffff MaxBytes=160000 MetricsBytes=166249 Bandwidth=Low BytesPerKSec=12000000 NumShards=4 92.699471 RelocateShardStartSplit Begin=ff02...^869c1aa012 End=ff02...^869c1aa0< MaxBytes=160000 MetricsBytes=166250 Bandwidth=Low BytesPerKSec=4000000 NumShards=4

Clearly, this is a sequential workload and splits are happening whenever the shard size exceeds 160k.

If this is a function of the test, then the question that follows is why it happens only for some seeds. One idea that comes to mind is that the max shard size may be getting modified.

Indeed when the seed is changed to 896089048 (buggify still on), the MaxBytes field is different, and there are many many fewer splits. QuietDatabase does not fail.

102.649030 RelocateShardStartSplit Begin= End=ffff MaxBytes=935180 MetricsBytes=997484 Bandwidth=Low BytesPerKSec=64000000 NumShards=3 108.242321 RelocateShardStartSplit Begin=ff02/blog/K97,f4M(0dw7f0bx70a*~ad9e End=ffff MaxBytes=1072520 MetricsBytes=1097245 Bandwidth=Low BytesPerKSec=84000000 NumShards=2 109.779826 RelocateShardStartSplit Begin=ff02/blog/K97,f4M(0dw7f0bx70a*~adN End=ff02/blog/K97,f4M(0dw7f0bx70a*~ad9e MaxBytes=1133000 MetricsBytes=1230246 Bandwidth=Low BytesPerKSec=48000000 NumShards=2 109.875635 RelocateShardStartSplit Begin=ff02/blog/K97,f4M(0dw7f0bx70a*~adbc End=ffff MaxBytes=1158020 MetricsBytes=1230248 Bandwidth=Low BytesPerKSec=76000000 NumShards=3 110.129974 RelocateShardStartSplit Begin= End=ff02/blog/K97,f4M(0dw7f0bx70a*~adN MaxBytes=1184120 MetricsBytes=1230243 Bandwidth=Low BytesPerKSec=96000000 NumShards=2 114.820702 RelocateShardStartSplit Begin=ff02/blog/K97,f4M(0dw7f0bx70a*~adn End=ff02/blog/K97,f4M(0dw7f0bx70a*~ad9e MaxBytes=1305260 MetricsBytes=1363247 Bandwidth=Low BytesPerKSec=56000000 NumShards=2 114.883366 RelocateShardStartSplit Begin=ff02/blog/K97,f4M(0dw7f0bx70a*~ad1c End=ff02/blog/K97,f4M(0dw7f0bx70a*~adN MaxBytes=1325060 MetricsBytes=1363246 Bandwidth=Normal BytesPerKSec=116000000 NumShards=2 115.735597 RelocateShardStartSplit Begin=ff02/blog/K97,f4M(0dw7f0bx70a*~ad9e End=ff02/blog/K97,f4M(0dw7f0bx70a*~adbc MaxBytes=1374020 MetricsBytes=1429747 Bandwidth=Low BytesPerKSec=52000000 NumShards=2 117.500047 RelocateShardStartSplit Begin=ff02/blog/K97,f4M(0dw7f0bx70a*~adN End=ff02/blog/K97,f4M(0dw7f0bx70a*~adn MaxBytes=1406060 MetricsBytes=1529499 Bandwidth=Low BytesPerKSec=68000000 NumShards=3 117.545695 RelocateShardStartSplit Begin=ff02/blog/K97,f4M(0dw7f0bx70a*~ad86 End=ff02/blog/K97,f4M(0dw7f0bx70a*~ad9e MaxBytes=1422620 MetricsBytes=1462999 Bandwidth=Low BytesPerKSec=56000000 NumShards=2

sfc-gh-bvr commented 2 years ago

MutationLogReaderCorrectnessWorkload shows that there are numerous records written with a backupLog prefix. Keys are incremented in fixed size steps. So, this is a rather sequential workload.

The next question is why this test has started failing only recently. This is based on the assumption that since this test appears in the fast directory, it is run rather often. In the issue description, Trevor writes that when this test was run 10k times, it failed 5 times. So, this is somewhat rare, but if it was a test design issue, surely, it should have been seen before. So, let’s assume something changed recently. Initial guesses -

volume of data generated by the workload

max shard size - somehow, let’s say this is lower than it used to be able to be

rate of data movement - say it became slower - perhaps a recent throttling change?

sfc-gh-bvr commented 2 years ago

MutationLogReaderCorrectnessWorkload hasn’t changed much since it was original written. The same goes for shard size calculations.

commits: ca68fefca2b580c30380ea4069ea37e6681a8a8e, cf453548333027dc49d240efcfc0e263d2224728, 268caa5ac86b2fb77cd58924898e189de509cf67, which change the shard size knobs look like they may apply.

Reverting cf453548333027dc49d240efcfc0e263d2224728 does ‘fix’ the problem.

93.393024 RelocateShardStartSplit Begin= End=ffff MaxBytes=931220 MetricsBytes=1030731 Bandwidth=Low BytesPerKSec=64000000 NumShards=3 102.990749 RelocateShardStartSplit Begin= End=ff02...^869c1aa0f MaxBytes=1097180 MetricsBytes=1097245 Bandwidth=Low BytesPerKSec=88000000 NumShards=2 104.893617 RelocateShardStartSplit Begin=ff02...^869c1aa0b0 End=ffff MaxBytes=1128140 MetricsBytes=1163743 Bandwidth=Low BytesPerKSec=64000000 NumShards=2 105.136543 RelocateShardStartSplit Begin=ff02...^869c1aa0f End=ff02...^869c1aa0b0 MaxBytes=1154960 MetricsBytes=1230243 Bandwidth=Low BytesPerKSec=72000000 NumShards=3 108.270919 RelocateShardStartSplit Begin=ff02...^869c1aa0# End=ff02...^869c1aa0f MaxBytes=1200500 MetricsBytes=1296747 Bandwidth=Low BytesPerKSec=80000000 NumShards=3 108.603755 RelocateShardStartSplit Begin=ff02...^869c1aa0d3 End=ffff MaxBytes=1222820 MetricsBytes=1329997 Bandwidth=Low BytesPerKSec=64000000 NumShards=3 112.044829 RelocateShardStartSplit Begin= End=ff02...^869c1aa0# MaxBytes=1300940 MetricsBytes=1363248 Bandwidth=Low BytesPerKSec=60000000 NumShards=2 118.188251 RelocateShardStartSplit Begin=ff02...^869c1aa0b0 End=ff02...^869c1aa0d3 MaxBytes=1385180 MetricsBytes=1529496 Bandwidth=Low BytesPerKSec=56000000 NumShards=3 123.360419 RelocateShardStartSplit Begin=ff02...^869c1aa0f End=ff02...^869c1aa082 MaxBytes=1458080 MetricsBytes=1529499 Bandwidth=Low BytesPerKSec=84000000 NumShards=3 123.523045 RelocateShardStartSplit Begin=ff02...^869c1aa0# End=ff02...^869c1aa0< MaxBytes=1471040 MetricsBytes=1496250 Bandwidth=Normal BytesPerKSec=100000000 NumShards=2

The MaxBytes is much larger. So, either a detRand path change or the size knob settings introduced by this seems to have caused the problem.

sfc-gh-bvr commented 2 years ago

buggifySmallShards makes it so the shard size is ~160k - MIN_SHARD_BYTES * SHARD_BYTES_RATIO, which results in a lot of split and move activity. https://github.com/apple/foundationdb/pull/6996/files (merged 2022-05-10) introduced a 1000 s time limit for QD checks, at which point this assert trips as an early and pointed indication of QD failing to converge, rather than a test timeout appearing later. If this 1000s is increased to 1500s instead, then the wait for QD does succeed. It seems like the combination of the seq. workload in the test and the small shard size just make it so this 1000s allowance is insufficient.

sfc-gh-yiwu commented 2 years ago

Looks like this still happens on main

Hash: 0fea3fb73

devRetryCorrectnessTest bin/fdbserver -r simulation -f tests/fast/MutationLogReaderCorrectness.toml -s 364594625 -b on --crash --trace_format json

sfc-gh-bvr commented 2 years ago

At that commit and with that seed, I'm not able to repro the problem either with gcc or clang. Would you mind verifying it please? Any other build flavor or factors that I may be missing?

sfc-gh-xwang commented 2 years ago

At that commit and with that seed, I'm not able to repro the problem either with gcc or clang. Would you mind verifying it, please? Any other build flavor or factors that I may be missing?

Not quite related to this issue, but I also struggled with reproducing TracedTooManyLines failures sometimes. The nightly running reports this kind of failure, then I retried the same case, the test will passed without any SevError events... Is it possible TracedTooManyLines results in some false positives? @sfc-gh-mpilman

sfc-gh-mpilman commented 2 years ago

I don’t know how that could happen. Tracing is deterministic (or it should be, if it isn’t that would be a serious bug).

i wonder whether the chosen log format makes a difference?

sfc-gh-yiwu commented 2 years ago

At that commit and with that seed, I'm not able to repro the problem either with gcc or clang. Would you mind verifying it please? Any other build flavor or factors that I may be missing?

@sfc-gh-bvr sorry didn't see your reply earlier. It is clang build and the seed fail consistently on my side. Weird.

sfc-gh-xwang commented 2 years ago

Here is another false positive case

commit: 34e6a8f94200f3738e8ee6fc58722a11fcbeb18e
devRetryCorrectnessTest bin/fdbserver -r simulation -f tests/fast/MutationLogReaderCorrectness.toml -s 873427466 -b on --crash --trace_format json ensemble: 20220619-045619-nightly_correctness_main-d2921b61b4691db3

I changed the buggify timeout to 2500s then the test passed.

sfc-gh-yiwu commented 2 years ago

Changing the timeout following #7290 and did a 100k run of MutationLogReaderCorrectness test. There's 1 seed out of 100k that needs a timeout > 3571 secs. cc @sfc-gh-bvr

sfc-gh-bvr commented 2 years ago

I increased the timeout further via https://github.com/apple/foundationdb/pull/7481. A better way to do this would be to monitor progress rather than have a fixed timeout, since run times can be long when buggify is on and certain knobs are set lower or higher than normal - future work on quietDatabase checker.

apple / foundationdb

MutationLogReaderCorrectness nightly failure - QuiteDatabase failure #7276