Open karalabe opened 11 months ago
@karalabe Thanks for the report. Do you have a complete CPU profile which we can explore? The image is cut off at the bottom. If you have any Pebble log from around this time, those could help. Especially the results from DB.Metrics()
.
If we're spending a ton of time in ReadBlock
, then we might be seeing a ton of pebble block cache misses, which is a metric present in DB.Metrics()
. Could you also describe the kind of workload being executed against Pebble?
Thanks.
I do have a CPU profile, just GitHub didn't let me upload it. Trying with a renamed name...
Our workload is an Ethereum node's block processing. That's the weird part. If I restart my node, everything goes back to normal, but it will just pick up where it left off with the processing... continuing normally with minimal load.
I'm trying to figure out if there's some other part of our code that might go haywire - we do have networking - or if it could be attributed somehow to Pebble internals. Based on the CPU profile, the callsite of the heavy thing is completely missing... bummer.
Here's a stack trace from a recent hang (i.e. it's hanging now on my machine).
Maybe you can see something that should not be there wrt pebble?
And a different CPU stack, but mostly same results:
FWIW, we do have metrics monitoring the db compaction rate/time/writes but all that is idling at 0.
Also, a very interesting phenomenon. If this issue triggers, it triggers on startup. If it does not happen on startup then it does not happen at all.
I've tried restarting the process and forcing a full compaction to see if it might be related. Nope. Processing became a bit slower but otherwise worked perfectly after the restart, even with full compaction going on in the background.
A few questions:
Options.Cache
?BlockSize
s, and if so, what are they set to?stacks.txt
contains a stack with > 100 stack frames in github.com/ethereum/go-ethereum/trie/triedb/pathdb.(*diffLayer).node
(which ultimately seems to call into (*pebble.DB)Get
). Is that expected?(*pebble.DB).Get
is not looping, but the application is invoking (*pebble.DB).Get
in a tight loop (eg, significantly more than during baseline CPU levels)?From the most recent CPU profile:
. . 112: // We've advanced the iterator passed the desired key. Move on to the
. . 113: // next memtable / level.
20ms 8.99s 114: g.err = g.iter.Close()
. . 115: g.iter = nil
. . 116: if g.err != nil {
. . 117: return nil, base.LazyValue{}
. . 118: }
. . 156: if g.level == 0 {
. . 157: // Create iterators from L0 from newest to oldest.
. . 158: if n := len(g.l0); n > 0 {
. 10ms 159: files := g.l0[n-1].Iter()
. . 160: g.l0 = g.l0[:n-1]
. . 161: iterOpts := IterOptions{logger: g.logger}
. . 162: g.levelIter.init(context.Background(), iterOpts, g.cmp, nil /* split */, g.newIters,
. . 163: files, manifest.L0Sublevel(n), internalIterOpts{})
. . 164: g.levelIter.initRangeDel(&g.rangeDelIter)
. . 165: g.iter = &g.levelIter
20ms 466.06s 166: g.iterKey, g.iterValue = g.iter.SeekGE(g.key, base.SeekGEFlagsNone)
. . 167: continue
. . 168: }
. . 169: g.level++
. . 170: }
(pprof) list readBlock
Total: 560.73s
ROUTINE ======================== github.com/cockroachdb/pebble/sstable.(*Reader).readBlock in sstable/reader.go
30ms 467.58s (flat, cum) 83.39% of Total
. . 2829:func (r *Reader) readBlock(
. . 2830: ctx context.Context,
. . 2831: bh BlockHandle,
. . 2832: transform blockTransform,
. . 2833: readHandle objstorage.ReadHandle,
. . 2834: stats *base.InternalIteratorStats,
. . 2835:) (handle cache.Handle, _ error) {
. 370ms 2836: if h := r.opts.Cache.Get(r.cacheID, r.fileNum, bh.Offset); h.Get() != nil {
. . 2837: if readHandle != nil {
. . 2838: readHandle.RecordCacheHit(int64(bh.Offset), int64(bh.Length+blockTrailerLen))
. . 2839: }
. . 2840: if stats != nil {
. . 2841: stats.BlockBytes += bh.Length
. . 2842: stats.BlockBytesInCache += bh.Length
. . 2843: }
. . 2844: return h, nil
. . 2845: }
. . 2846:
. 520ms 2847: v := r.opts.Cache.Alloc(int(bh.Length + blockTrailerLen))
. . 2848: b := v.Buf()
. . 2849: readStartTime := time.Now()
. . 2850: var err error
. . 2851: if readHandle != nil {
. 85.25s 2852: _, err = readHandle.ReadAt(b, int64(bh.Offset))
. . 2853: } else {
. 40ms 2854: _, err = r.readable.ReadAt(b, int64(bh.Offset))
. . 2855: }
. . 2856: readDuration := time.Since(readStartTime)
. . 2857: // TODO(sumeer): should the threshold be configurable.
. . 2858: const slowReadTracingThreshold = 5 * time.Millisecond
. . 2859: // The invariants.Enabled path is for deterministic testing.
. . 2860: if invariants.Enabled {
. . 2861: readDuration = slowReadTracingThreshold
. . 2862: }
. . 2863: // Call IsTracingEnabled to avoid the allocations of boxing integers into an
. . 2864: // interface{}, unless necessary.
. . 2865: if readDuration >= slowReadTracingThreshold && r.opts.LoggerAndTracer.IsTracingEnabled(ctx) {
. . 2866: r.opts.LoggerAndTracer.Eventf(ctx, "reading %d bytes took %s",
. . 2867: bh.Length+blockTrailerLen, readDuration.String())
. . 2868: }
. . 2869: if stats != nil {
. . 2870: stats.BlockReadDuration += readDuration
. . 2871: }
10ms 10ms 2872: if err != nil {
. . 2873: r.opts.Cache.Free(v)
. . 2874: return cache.Handle{}, err
. . 2875: }
. . 2876:
20ms 380.90s 2877: if err := checkChecksum(r.checksumType, b, bh, r.fileNum); err != nil {
. . 2878: r.opts.Cache.Free(v)
. . 2879: return cache.Handle{}, err
. . 2880: }
. . 2881:
. . 2882: typ := blockType(b[bh.Length])
. . 2883: b = b[:bh.Length]
. . 2884: v.Truncate(len(b))
. . 2885:
. 170ms 2886: decoded, err := decompressBlock(r.opts.Cache, typ, b)
. . 2887: if decoded != nil {
. 40ms 2888: r.opts.Cache.Free(v)
. . 2889: v = decoded
. . 2890: b = v.Buf()
. . 2891: } else if err != nil {
. . 2892: r.opts.Cache.Free(v)
. . 2893: return cache.Handle{}, err
. . 2894: }
. . 2895:
. . 2896: if transform != nil {
. . 2897: // Transforming blocks is rare, so the extra copy of the transformed data
. . 2898: // is not problematic.
. . 2899: var err error
. . 2900: b, err = transform(b)
. . 2901: if err != nil {
. . 2902: r.opts.Cache.Free(v)
. . 2903: return cache.Handle{}, err
. . 2904: }
. . 2905: newV := r.opts.Cache.Alloc(len(b))
. . 2906: copy(newV.Buf(), b)
. . 2907: r.opts.Cache.Free(v)
. . 2908: v = newV
. . 2909: }
. . 2910:
. . 2911: if stats != nil {
. . 2912: stats.BlockBytes += bh.Length
. . 2913: }
. . 2914:
. 280ms 2915: h := r.opts.Cache.Set(r.cacheID, r.fileNum, bh.Offset, v)
. . 2916: return h, nil
. . 2917:}
. . 2918:
. . 2919:func (r *Reader) transformRangeDelV1(b []byte) ([]byte, error) {
. . 2920: // Convert v1 (RocksDB format) range-del blocks to v2 blocks on the fly. The
It looks like there are repeated cache misses, particularly for blocks within L0. The cost of reads scales linearly with the number of sublevels in L0.
How big is the Options.Cache?
In the runs above, it is 1GB.
Have you set custom BlockSizes, and if so, what are they set to?
No. We do store everything in 2MB files tho. It is insane on the file count but performance wise it always outperformed the default schemas.
How large of KVs are you storing in Pebble? And in particular, how large of KVs is this particular codepath reading?
I'd say that the keys are up to 64B in size and the values are mostly < 512B. The total size of this DB that is reading weird is around 300GB.
A single read that you can see in the stacks are for a single key, so around 100-500B value.
I noticed stacks.txt contains a stack with > 100 stack frames in github.com/ethereum/go-ethereum/trie/triedb/pathdb.(diffLayer).node (which ultimately seems to call into (pebble.DB)Get). Is that expected?
Yes, that looks a bit odd but expected. In our usage pattern, we have a version of the data on disk, which is not expected to change much and we have 128 tiny modifications on top where we can jump around from one version to the other. The 128 call stack is just a reader trying to find a value in all the latest 128 volatile modifications, failing and then finally reaching to the db to read the persistent thing.
Is it possible (pebble.DB).Get is not looping, but the application is invoking (pebble.DB).Get in a loop?
Absolutely, but the reason I find it odd - still could happen! - is that our data processing and associated access pattern is 100% deterministic. In the case of this strange read loop if I stop the process and restart it, it will continue exactly where it left off with the same access patterns.... but it will just run smoothly without an issue.
Do you know what the read amplification was/is when the CPU saturation occurs?
Compared to the expected reads for the same workload?
Before the problematic amplification, our processing read 5GB of data from Pebble in 1h. Upon restarting and hitting this issue, the processing in the next hour read 23TB from Pebble. So approximately 4600x read amplification :/
On interesting data point is that this issue only ever happened on startup. If on startup this got triggered, it just keeps going. I saw it once subside in 3 hours, yesterday I stopped it after 4ish. Today I stopped it after 2. But if on startup everything's ok, I have never seen it happen.
Also worth noting that I did try nuking out all the network traffic just to make sure it doesn't originate there and it did not help. Complete internet isolation still kept the same disk read load up.
But again, I cannot say for sure that this is a fault in Pebble. I'm just running out of ideas on what to check and the first time we noticed it was after updating to the 23.1 release. Previously we were on some very old release and never seen this before.
FWIW I can try downgrading to an older Pebble our of curiosity, but given how unlikely this issue is to trigger in general, I won't really know if it doesn't trigger because the old Pebble or because it just didn't trigger yet.
Hmm, I may have a set of interesting metrics. We're tracking how many data items we hit in our in-ram caches and how many miss (go down to the disk). I should be able to compare how many db requests we've made when things operate normally and when crap hits the fan. Gotta run for a few hours but I'll get back with those charts.
In theory, if our metrics report that we're hitting pebble with +- the same load in both cases, that would be a strong case pointing towards a fault in Pebble. If our metrics report that we ourselves are having gazillions of cache misses, then it's clearly us who are hitting Pebble like crazy... Exciting... metrics rule. Brb.
Sounds good. If you're able to, can you also collect measurements of Metrics.ReadAmp over time. If data is written to Pebble more quickly than Pebble is able to compact into lower levels, L0 can accumulate many files and, in particular, many sublevels. Every additional sublevel causes an additional block load during a read. I'm wondering if there's a spike in the number of sublevels, resulting in each Get
expending much more time reading blocks.
One more question:
runtime.GOMAXPROCS(0)
return)?We scale the block cache shard count according to the number of procs, and this scaling factor was changed recently (ce43e6535942905b82733884aafdca570c30e4ee).
How many CPUs does the process have available
48
This is the CPU chart of our process. It was restarted in the middle, the point when in entered this weird CPU load.
Looking at our data processing pipeline, we track how many data items are missed from a cache and need to be loaded from disk. I.e. each miss is a database read (pretty much 1-to-1 a pebble.Get).
There's a single instantaneous spike, which is essentially the process starting up and the entire cache being cold, so everything is a miss. After the initial spike however, the database reads - as reported by this specific subsystem - is negligible. It is so overloaded that it actually doesn't even get to do the same number of reads as before.
In the original pprof image, the entire load originates from this processing pathway:
And that pathway basically tracks all db reads as a cache miss.
My hunch is still that this amplification somehow originates from Pebble with the above as seemingly proof. Now, I could imagine too much junk accumulating somewhere and triggering slowdowns, but I doubt them due to:
My vote is still something funny in Pebble. That said, I cannot guarantee that some other code path doesn't hammer Pebble. But with networking down... I can't imagine what else there is to hit the disk.
I'm adding a bit of support to expose the Pebble metrics as a string dump via https://pkg.go.dev/github.com/cockroachdb/pebble#Metrics.String and will get back if I have anything of interest.
Restarted the thing with Pebble metrics now enabled. On startup (+ a few secs I needed to pull the stats), I've seen
__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
WAL 2 1.1 M - 1.1 M - - - - 1.1 M - - - 1.0
0 196 716 M 0.50 0 B 0 B 0 0 B 0 0 B 0 0 B 1 0.0
1 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0
2 38 53 M 1.00 286 M 0 B 0 0 B 0 278 M 168 311 M 1 1.0
3 334 449 M 1.00 238 M 0 B 0 19 M 10 501 M 353 583 M 1 2.1
4 499 845 M 0.26 164 M 0 B 0 0 B 0 253 M 153 297 M 1 1.5
5 360 3.4 M 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 1 0.0
6 116048 214 G - 0 B 0 B 0 0 B 0 0 B 0 0 B 1 0.0
total 117475 216 G - 1.1 M 0 B 0 19 M 10 1.0 G 674 1.2 G 6 949.8
flush 0 0 B 0 0 (ingest = tables-ingested, move = ingested-as-flushable)
compact 239 3.0 G 0 B 0 (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
ctype 229 0 0 10 0 0 0 (default, delete, elision, move, read, rewrite, multi-level)
memtbl 2 512 M
zmemtbl 0 0 B
ztbl 0 0 B
bcache 273 K 229 M 51.8% (score == hit-rate)
tcache 41 K 28 M 29.3% (score == hit-rate)
snaps 0 - 0 (score == earliest seq num)
titers 0
filter - - 0.0% (score == utility)
About 30 mins later I'm at almost 1TB reads, linearly going up with no sight of stopping.
__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
WAL 2 116 M - 116 M - - - - 116 M - - - 1.0
0 196 716 M 0.50 0 B 0 B 0 0 B 0 0 B 0 0 B 1 0.0
1 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0
2 38 53 M 1.00 286 M 0 B 0 0 B 0 278 M 168 311 M 1 1.0
3 334 449 M 1.00 238 M 0 B 0 19 M 10 501 M 353 583 M 1 2.1
4 499 845 M 0.26 164 M 0 B 0 0 B 0 253 M 153 297 M 1 1.5
5 360 3.4 M 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 1 0.0
6 116048 214 G - 0 B 0 B 0 0 B 0 0 B 0 0 B 1 0.0
total 117475 216 G - 116 M 0 B 0 19 M 10 1.1 G 674 1.2 G 6 9.9
flush 0 0 B 0 0 (ingest = tables-ingested, move = ingested-as-flushable)
compact 239 3.0 G 0 B 0 (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
ctype 229 0 0 10 0 0 0 (default, delete, elision, move, read, rewrite, multi-level)
memtbl 2 512 M
zmemtbl 0 0 B
ztbl 0 0 B
bcache 755 K 1.5 G 84.4% (score == hit-rate)
tcache 30 K 20 M 83.9% (score == hit-rate)
snaps 0 - 0 (score == earliest seq num)
titers 0
filter - - 0.0% (score == utility)
The CPU load is not as accented as before, but yesterday I did do a whole database compaction, so that might explain why both the CPU and disk isn't hit like crazy now.
Still, the database reads keep growing at an alarming rate. I'll try restarting and see it it normalizes and what that reports. Note, doing a profiling run most of the time is still spend by Pebble doing some CRC hashings, so I assume it's loading data like crazy.
After a restart everything seems to work ok, disk reads in the first 30 minutes was around 4GB total.
Pebble metrics after the 30 minute mark are:
__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
WAL 2 110 M - 110 M - - - - 110 M - - - 1.0
0 218 757 M 0.50 0 B 0 B 0 0 B 0 0 B 0 0 B 1 0.0
1 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0
2 42 62 M 1.00 92 M 0 B 0 203 M 1 110 M 68 114 M 1 1.2
3 323 442 M 0.99 279 M 0 B 0 3.3 M 3 397 M 128 414 M 1 1.4
4 533 1.1 G 0.33 272 M 0 B 0 0 B 0 308 M 63 318 M 1 1.1
5 360 3.4 M 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 1 0.0
6 116048 214 G - 0 B 0 B 0 0 B 0 0 B 0 0 B 1 0.0
total 117524 217 G - 110 M 0 B 0 206 M 4 925 M 259 846 M 6 8.4
flush 0 0 B 0 0 (ingest = tables-ingested, move = ingested-as-flushable)
compact 83 3.3 G 0 B 0 (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
ctype 79 0 0 4 0 0 0 (default, delete, elision, move, read, rewrite, multi-level)
memtbl 2 512 M
zmemtbl 0 0 B
ztbl 0 0 B
bcache 592 K 1.4 G 85.6% (score == hit-rate)
tcache 33 K 22 M 78.6% (score == hit-rate)
snaps 0 - 0 (score == earliest seq num)
titers 0
filter - - 0.0% (score == utility)
Pebble is also completely missing from the CPU profile it's so insignificant load wise: cpu.profile.ok.txt
Just to check whether it's networking or some background thing in Pebble:
Seems to definitely originate from that specific data pipeline. I'll try to add support on mocking out even more stuff from our code to isolate it even better to that thing. It just takes a non trivial effort to jump through all the hoops - on a production server... 😓
I think I've found something of interest. We have an API call to basically do a direct db read.
So I took a "faulty" run and basically stopped all processing in it. Everything goes idle. Then I did 2 read "benchmarks" (more on the key layer):
1000x Pebble.Get("0x41" ++ <32 len random hex>) => 467ms
1000x Pebble.Get("0x4f" ++ <32 len random hex>) => 62500ms
__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
WAL 1 340 M - 1.1 G - - - - 1.1 G - - - 1.0
0 271 938 M 0.54 795 M 0 B 0 0 B 0 535 M 175 0 B 1 0.7
1 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0
2 37 55 M 1.00 598 M 0 B 0 0 B 0 586 M 237 628 M 1 1.0
3 343 448 M 0.99 546 M 0 B 0 12 M 6 950 M 558 1.2 G 1 1.7
4 734 1.4 G 0.61 293 M 0 B 0 0 B 0 371 M 120 625 M 1 1.3
5 360 3.4 M 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 1 0.0
6 116048 214 G - 0 B 0 B 0 0 B 0 0 B 0 0 B 1 0.0
total 117793 217 G - 1.1 G 0 B 0 12 M 6 3.5 G 1.1 K 2.4 G 6 3.2
flush 2 0 B 0 0 (ingest = tables-ingested, move = ingested-as-flushable)
compact 294 4.1 G 0 B 0 (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
ctype 288 0 0 6 0 0 0 (default, delete, elision, move, read, rewrite, multi-level)
memtbl 1 512 M
zmemtbl 0 0 B
ztbl 0 0 B
bcache 922 K 1.5 G 91.9% (score == hit-rate)
tcache 32 K 22 M 95.3% (score == hit-rate)
snaps 0 - 0 (score == earliest seq num)
titers 0
filter - - 0.0% (score == utility)
Doing a restart, the process seems to execute properly fast now. Running the same benchmarks:
1000x Pebble.Get("0x41" ++ <32 len random hex>) => 445ms
1000x Pebble.Get("0x4f" ++ <32 len random hex>) => 629ms
__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
WAL 2 104 M - 103 M - - - - 104 M - - - 1.0
0 286 1.0 G 0.57 0 B 0 B 0 0 B 0 0 B 0 0 B 1 0.0
1 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0
2 35 51 M 0.97 177 M 0 B 0 171 M 1 162 M 98 180 M 1 0.9
3 363 438 M 0.97 328 M 0 B 0 6.7 M 4 458 M 225 498 M 1 1.4
4 799 1.5 G 0.64 305 M 0 B 0 0 B 0 407 M 145 621 M 1 1.3
5 360 3.4 M 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 1 0.0
6 116048 214 G - 0 B 0 B 0 0 B 0 0 B 0 0 B 1 0.0
total 117891 218 G - 104 M 0 B 0 178 M 5 1.1 G 468 1.3 G 6 10.9
flush 0 0 B 0 0 (ingest = tables-ingested, move = ingested-as-flushable)
compact 167 4.4 G 0 B 0 (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
ctype 162 0 0 5 0 0 0 (default, delete, elision, move, read, rewrite, multi-level)
memtbl 2 512 M
zmemtbl 0 0 B
ztbl 0 0 B
bcache 483 K 1.1 G 81.6% (score == hit-rate)
tcache 38 K 26 M 69.4% (score == hit-rate)
snaps 0 - 0 (score == earliest seq num)
titers 0
filter - - 0.0% (score == utility)
Pebble vs us to blame:
Now, the good news. One key range worked just as fast in both slow and normal mode; vs another key range exhibited a 100x slowdown in slow mode. The difference is the key range density.
And this can also be seen from our internal metrics (just we couldn't really trust it), that data processing accessing the first key range is 100x faster than data processing related to the second key range.
Count/size wise, the first key range is possibly significantly smaller than the second key range. One order of magnitude for use, maybe 2, but I'd need to gather some stats, I don't have those laying around.
Metrics.String()
output taken multiple times throughout while the CPU usage is high would be instructive.OPTIONS-XXXXXX
file?Both of the cpu profiles you provided that show significant DB.Get
CPU usage show that it's all in reading L0 files.
. . 156: if g.level == 0 {
. . 157: // Create iterators from L0 from newest to oldest.
. . 158: if n := len(g.l0); n > 0 {
. 10ms 159: files := g.l0[n-1].Iter()
. . 160: g.l0 = g.l0[:n-1]
. . 161: iterOpts := IterOptions{logger: g.logger}
. . 162: g.levelIter.init(context.Background(), iterOpts, g.cmp, nil /* split */, g.newIters,
. . 163: files, manifest.L0Sublevel(n), internalIterOpts{})
. . 164: g.levelIter.initRangeDel(&g.rangeDelIter)
. . 165: g.iter = &g.levelIter
20ms 466.06s 166: g.iterKey, g.iterValue = g.iter.SeekGE(g.key, base.SeekGEFlagsNone)
. . 167: continue
. . 168: }
. . 169: g.level++
. . 170: }
. . 156: if g.level == 0 {
. . 157: // Create iterators from L0 from newest to oldest.
. . 158: if n := len(g.l0); n > 0 {
. . 159: files := g.l0[n-1].Iter()
. . 160: g.l0 = g.l0[:n-1]
. . 161: iterOpts := IterOptions{logger: g.logger}
. . 162: g.levelIter.init(context.Background(), iterOpts, g.cmp, nil /* split */, g.newIters,
. . 163: files, manifest.L0Sublevel(n), internalIterOpts{})
. . 164: g.levelIter.initRangeDel(&g.rangeDelIter)
. . 165: g.iter = &g.levelIter
10ms 119.77s 166: g.iterKey, g.iterValue = g.iter.SeekGE(g.key, base.SeekGEFlagsNone)
. . 167: continue
. . 168: }
. . 169: g.level++
. . 170: }
The Metrics.String() output you've provided hasn't shown high sublevels, but they've all been taken after hundreds of compactions have already completed.
The Metrics.String() output taken multiple times throughout while the CPU usage is high would be instructive.
You have two here: https://github.com/cockroachdb/pebble/issues/2945#issuecomment-1740389293
Is your write workload bursty?
Yes, we accumulate about 64MB of writes and then throw it at Pebble. It does take a while, so often Pebble would be idling on the writes (or very minimal).
Can you provide me with the contents of your OPTIONS-XXXXXX file?
Sure: OPTIONS-1062222.txt
The Metrics.String() output you've provided hasn't shown high sublevels, but they've all been taken after hundreds of compactions have already completed.
I'm just restarting the same processing and continuing where the previous left off. Sometimes if slows down, sometimes not. I'm still trying to find some correlation there.
I am noticing a very very weird pattern here, which I cannot fathom to explain why it matters. Consider it a random braindump.
This process is ran via docker. Every time I update the container to a new image (I can just fix a typo, no functional change needed), it will trigger the slowdown. If i just nuke the container and recreate it, stop and restart, whatever els; it doesn't happen.
Hah, no. Just restarted it and it happened consecutively again. So no, unrelated to update. Perhaps some timing issue somewhere... ugh
FWIW, I've tried swapping between Pebble 23.1 and the master branch from a few weeks ago (one of our colleagues sent a few PRs in). Until now, this issue only happens on 23.1
but not on master
. That said, it's not like it's so easy to trigger, so maybe master just got lucky in my tests so far. Will keep trying.
I've switched my prod instance over to Pebble master. I could't repro the issue with it, so lets see if running it the same way I ran 23.1 previously will trigger. Will get back if anything suspicious appears.
I've asked someone else who's been regularly hitting this issue (50% of restarts) to check with Pebble master, and they also confirmed that the issue is gone.
This pretty much concludes that indeed Pebble was/is at fault, however, I don't have the capacity to repro exactly which commit in the last half a year fixed it or why. I am unsure what's the best course of action from your part on this, given that it is a gnarly issue.
I'm glad it appears resolved. CockroachDB does not make use of the DB.Get
API at all, and we've not encountered any similar behavior, so I don't we can dedicated the resources to root cause it.
A bit of an aside, but do you provide a Split
function on your Options.Comparer
? Historically DB.Get
hasn't taken advantage of bloom filters because the bloom filters are defined specifically over the Split
-defined "prefix" of a key in order to support MVCC applications like CockroachDB. In #2846 a contributor added support for using bloom filters during DB.Get
lookups if Options.Comparer.Split != nil
. If you're not already providing a Split
function, you may want to experiment with setting Split
to func(key []byte) int { return len(key) }
.
I'm aware of that PR yes, we've sent it :)
For the comparer, we just use the Pebble default. Digging through the code, that doesn't seem to set a Split method. Perhaps that should also set something by default if it's relevant?
@jbowens bloom filter will also be used in Get even Split
is not specified, the default prefix is just the whole user key.
@jbowens bloom filter will also be used in Get even Split is not specified, the default prefix is just the whole user key.
Ah, nice! I missed that. It's possible that you stopped observing high CPU usage in this scenario simply because @rjl493456442's change eliminated unnecessary block loads through bloom filter exclusions. The root cause of the high CPU usage may remain and simply be mitigated.
This happened to me twice in the last 2 weeks and we've received a report from someone else too that it happened to them. I have absolutely no idea how to reproduce it.
Very rarely our process gets into this very weird read loop where it's self DoS-ing somehow. I'm trying to find where the reads originate from, but my NVME SSD is maxed out at multi GB/s reads and the CPU is maxed out in some Pebble internal process. I'm hoping you can maybe help debug this.
If I restart my process everything goes back to perfectly normal and it doens't happen again. I've tried reproducing it, but failed miserably.
However, this last time it happened I did managed to make a CPU profile. I've attached the pprof web view. Basically the entire process is stuck in Pebble's sstable.ReadBlock, about 80% ending up in crc update calculation and 20% in a syscall. The IOwait during this time is negligible (20% out of a max 4800%).
The pebble we use is
v0.0.0-20230906160148-46873a6a7a06
(i.e. 23.1, beginning of September release).Jira issue: PEBBLE-74