db: db.Get exhibiting high CPU usage, repeated cache misses

cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go

BSD 3-Clause "New" or "Revised" License

4.78k stars 444 forks source link

db: db.Get exhibiting high CPU usage, repeated cache misses #2945

Open karalabe opened 11 months ago

karalabe commented 11 months ago

This happened to me twice in the last 2 weeks and we've received a report from someone else too that it happened to them. I have absolutely no idea how to reproduce it.

Very rarely our process gets into this very weird read loop where it's self DoS-ing somehow. I'm trying to find where the reads originate from, but my NVME SSD is maxed out at multi GB/s reads and the CPU is maxed out in some Pebble internal process. I'm hoping you can maybe help debug this.

If I restart my process everything goes back to perfectly normal and it doens't happen again. I've tried reproducing it, but failed miserably.

However, this last time it happened I did managed to make a CPU profile. I've attached the pprof web view. Basically the entire process is stuck in Pebble's sstable.ReadBlock, about 80% ending up in crc update calculation and 20% in a syscall. The IOwait during this time is negligible (20% out of a max 4800%).

pprof001

The pebble we use is v0.0.0-20230906160148-46873a6a7a06 (i.e. 23.1, beginning of September release).

Jira issue: PEBBLE-74

bananabrick commented 11 months ago

@karalabe Thanks for the report. Do you have a complete CPU profile which we can explore? The image is cut off at the bottom. If you have any Pebble log from around this time, those could help. Especially the results from DB.Metrics().

If we're spending a ton of time in ReadBlock, then we might be seeing a ton of pebble block cache misses, which is a metric present in DB.Metrics(). Could you also describe the kind of workload being executed against Pebble?

Thanks.

karalabe commented 11 months ago

I do have a CPU profile, just GitHub didn't let me upload it. Trying with a renamed name...

cpuprofile.txt

karalabe commented 11 months ago

Our workload is an Ethereum node's block processing. That's the weird part. If I restart my node, everything goes back to normal, but it will just pick up where it left off with the processing... continuing normally with minimal load.

I'm trying to figure out if there's some other part of our code that might go haywire - we do have networking - or if it could be attributed somehow to Pebble internals. Based on the CPU profile, the callsite of the heavy thing is completely missing... bummer.

karalabe commented 11 months ago

Here's a stack trace from a recent hang (i.e. it's hanging now on my machine).

stacks.txt

Maybe you can see something that should not be there wrt pebble?

karalabe commented 11 months ago

And a different CPU stack, but mostly same results:

cpu.profile.txt

karalabe commented 11 months ago

FWIW, we do have metrics monitoring the db compaction rate/time/writes but all that is idling at 0.

karalabe commented 11 months ago

Also, a very interesting phenomenon. If this issue triggers, it triggers on startup. If it does not happen on startup then it does not happen at all.

karalabe commented 11 months ago

I've tried restarting the process and forcing a full compaction to see if it might be related. Nope. Processing became a bit slower but otherwise worked perfectly after the restart, even with full compaction going on in the background.

jbowens commented 11 months ago

A few questions:

How big is the Options.Cache?
Have you set custom BlockSizes, and if so, what are they set to?
How large of KVs are you storing in Pebble? And in particular, how large of KVs is this particular codepath reading?
I noticed stacks.txt contains a stack with > 100 stack frames in github.com/ethereum/go-ethereum/trie/triedb/pathdb.(*diffLayer).node (which ultimately seems to call into (*pebble.DB)Get). Is that expected?
Is it possible (*pebble.DB).Get is not looping, but the application is invoking (*pebble.DB).Get in a tight loop (eg, significantly more than during baseline CPU levels)?
Do you know what the read amplification was/is when the CPU saturation occurs?

From the most recent CPU profile:

         .          .    112:           // We've advanced the iterator passed the desired key. Move on to the
         .          .    113:           // next memtable / level.
      20ms      8.99s    114:           g.err = g.iter.Close()
         .          .    115:           g.iter = nil
         .          .    116:           if g.err != nil {
         .          .    117:               return nil, base.LazyValue{}
         .          .    118:           }

         .          .    156:       if g.level == 0 {
         .          .    157:           // Create iterators from L0 from newest to oldest.
         .          .    158:           if n := len(g.l0); n > 0 {
         .       10ms    159:               files := g.l0[n-1].Iter()
         .          .    160:               g.l0 = g.l0[:n-1]
         .          .    161:               iterOpts := IterOptions{logger: g.logger}
         .          .    162:               g.levelIter.init(context.Background(), iterOpts, g.cmp, nil /* split */, g.newIters,
         .          .    163:                   files, manifest.L0Sublevel(n), internalIterOpts{})
         .          .    164:               g.levelIter.initRangeDel(&g.rangeDelIter)
         .          .    165:               g.iter = &g.levelIter
      20ms    466.06s    166:               g.iterKey, g.iterValue = g.iter.SeekGE(g.key, base.SeekGEFlagsNone)
         .          .    167:               continue
         .          .    168:           }
         .          .    169:           g.level++
         .          .    170:       }

(pprof) list readBlock
Total: 560.73s
ROUTINE ======================== github.com/cockroachdb/pebble/sstable.(*Reader).readBlock in sstable/reader.go
      30ms    467.58s (flat, cum) 83.39% of Total
         .          .   2829:func (r *Reader) readBlock(
         .          .   2830:   ctx context.Context,
         .          .   2831:   bh BlockHandle,
         .          .   2832:   transform blockTransform,
         .          .   2833:   readHandle objstorage.ReadHandle,
         .          .   2834:   stats *base.InternalIteratorStats,
         .          .   2835:) (handle cache.Handle, _ error) {
         .      370ms   2836:   if h := r.opts.Cache.Get(r.cacheID, r.fileNum, bh.Offset); h.Get() != nil {
         .          .   2837:       if readHandle != nil {
         .          .   2838:           readHandle.RecordCacheHit(int64(bh.Offset), int64(bh.Length+blockTrailerLen))
         .          .   2839:       }
         .          .   2840:       if stats != nil {
         .          .   2841:           stats.BlockBytes += bh.Length
         .          .   2842:           stats.BlockBytesInCache += bh.Length
         .          .   2843:       }
         .          .   2844:       return h, nil
         .          .   2845:   }
         .          .   2846:
         .      520ms   2847:   v := r.opts.Cache.Alloc(int(bh.Length + blockTrailerLen))
         .          .   2848:   b := v.Buf()
         .          .   2849:   readStartTime := time.Now()
         .          .   2850:   var err error
         .          .   2851:   if readHandle != nil {
         .     85.25s   2852:       _, err = readHandle.ReadAt(b, int64(bh.Offset))
         .          .   2853:   } else {
         .       40ms   2854:       _, err = r.readable.ReadAt(b, int64(bh.Offset))
         .          .   2855:   }
         .          .   2856:   readDuration := time.Since(readStartTime)
         .          .   2857:   // TODO(sumeer): should the threshold be configurable.
         .          .   2858:   const slowReadTracingThreshold = 5 * time.Millisecond
         .          .   2859:   // The invariants.Enabled path is for deterministic testing.
         .          .   2860:   if invariants.Enabled {
         .          .   2861:       readDuration = slowReadTracingThreshold
         .          .   2862:   }
         .          .   2863:   // Call IsTracingEnabled to avoid the allocations of boxing integers into an
         .          .   2864:   // interface{}, unless necessary.
         .          .   2865:   if readDuration >= slowReadTracingThreshold && r.opts.LoggerAndTracer.IsTracingEnabled(ctx) {
         .          .   2866:       r.opts.LoggerAndTracer.Eventf(ctx, "reading %d bytes took %s",
         .          .   2867:           bh.Length+blockTrailerLen, readDuration.String())
         .          .   2868:   }
         .          .   2869:   if stats != nil {
         .          .   2870:       stats.BlockReadDuration += readDuration
         .          .   2871:   }
      10ms       10ms   2872:   if err != nil {
         .          .   2873:       r.opts.Cache.Free(v)
         .          .   2874:       return cache.Handle{}, err
         .          .   2875:   }
         .          .   2876:
      20ms    380.90s   2877:   if err := checkChecksum(r.checksumType, b, bh, r.fileNum); err != nil {
         .          .   2878:       r.opts.Cache.Free(v)
         .          .   2879:       return cache.Handle{}, err
         .          .   2880:   }
         .          .   2881:
         .          .   2882:   typ := blockType(b[bh.Length])
         .          .   2883:   b = b[:bh.Length]
         .          .   2884:   v.Truncate(len(b))
         .          .   2885:
         .      170ms   2886:   decoded, err := decompressBlock(r.opts.Cache, typ, b)
         .          .   2887:   if decoded != nil {
         .       40ms   2888:       r.opts.Cache.Free(v)
         .          .   2889:       v = decoded
         .          .   2890:       b = v.Buf()
         .          .   2891:   } else if err != nil {
         .          .   2892:       r.opts.Cache.Free(v)
         .          .   2893:       return cache.Handle{}, err
         .          .   2894:   }
         .          .   2895:
         .          .   2896:   if transform != nil {
         .          .   2897:       // Transforming blocks is rare, so the extra copy of the transformed data
         .          .   2898:       // is not problematic.
         .          .   2899:       var err error
         .          .   2900:       b, err = transform(b)
         .          .   2901:       if err != nil {
         .          .   2902:           r.opts.Cache.Free(v)
         .          .   2903:           return cache.Handle{}, err
         .          .   2904:       }
         .          .   2905:       newV := r.opts.Cache.Alloc(len(b))
         .          .   2906:       copy(newV.Buf(), b)
         .          .   2907:       r.opts.Cache.Free(v)
         .          .   2908:       v = newV
         .          .   2909:   }
         .          .   2910:
         .          .   2911:   if stats != nil {
         .          .   2912:       stats.BlockBytes += bh.Length
         .          .   2913:   }
         .          .   2914:
         .      280ms   2915:   h := r.opts.Cache.Set(r.cacheID, r.fileNum, bh.Offset, v)
         .          .   2916:   return h, nil
         .          .   2917:}
         .          .   2918:
         .          .   2919:func (r *Reader) transformRangeDelV1(b []byte) ([]byte, error) {
         .          .   2920:   // Convert v1 (RocksDB format) range-del blocks to v2 blocks on the fly. The

It looks like there are repeated cache misses, particularly for blocks within L0. The cost of reads scales linearly with the number of sublevels in L0.

karalabe commented 11 months ago

How big is the Options.Cache?

In the runs above, it is 1GB.

Have you set custom BlockSizes, and if so, what are they set to?

No. We do store everything in 2MB files tho. It is insane on the file count but performance wise it always outperformed the default schemas.

How large of KVs are you storing in Pebble? And in particular, how large of KVs is this particular codepath reading?

I'd say that the keys are up to 64B in size and the values are mostly < 512B. The total size of this DB that is reading weird is around 300GB.

A single read that you can see in the stacks are for a single key, so around 100-500B value.

I noticed stacks.txt contains a stack with > 100 stack frames in github.com/ethereum/go-ethereum/trie/triedb/pathdb.(diffLayer).node (which ultimately seems to call into (pebble.DB)Get). Is that expected?

Yes, that looks a bit odd but expected. In our usage pattern, we have a version of the data on disk, which is not expected to change much and we have 128 tiny modifications on top where we can jump around from one version to the other. The 128 call stack is just a reader trying to find a value in all the latest 128 volatile modifications, failing and then finally reaching to the db to read the persistent thing.

Is it possible (pebble.DB).Get is not looping, but the application is invoking (pebble.DB).Get in a loop?

Absolutely, but the reason I find it odd - still could happen! - is that our data processing and associated access pattern is 100% deterministic. In the case of this strange read loop if I stop the process and restart it, it will continue exactly where it left off with the same access patterns.... but it will just run smoothly without an issue.

Do you know what the read amplification was/is when the CPU saturation occurs?

Compared to the expected reads for the same workload?

Before the problematic amplification, our processing read 5GB of data from Pebble in 1h. Upon restarting and hitting this issue, the processing in the next hour read 23TB from Pebble. So approximately 4600x read amplification :/

On interesting data point is that this issue only ever happened on startup. If on startup this got triggered, it just keeps going. I saw it once subside in 3 hours, yesterday I stopped it after 4ish. Today I stopped it after 2. But if on startup everything's ok, I have never seen it happen.

Also worth noting that I did try nuking out all the network traffic just to make sure it doesn't originate there and it did not help. Complete internet isolation still kept the same disk read load up.

But again, I cannot say for sure that this is a fault in Pebble. I'm just running out of ideas on what to check and the first time we noticed it was after updating to the 23.1 release. Previously we were on some very old release and never seen this before.

karalabe commented 11 months ago

FWIW I can try downgrading to an older Pebble our of curiosity, but given how unlikely this issue is to trigger in general, I won't really know if it doesn't trigger because the old Pebble or because it just didn't trigger yet.

karalabe commented 11 months ago

Hmm, I may have a set of interesting metrics. We're tracking how many data items we hit in our in-ram caches and how many miss (go down to the disk). I should be able to compare how many db requests we've made when things operate normally and when crap hits the fan. Gotta run for a few hours but I'll get back with those charts.

In theory, if our metrics report that we're hitting pebble with +- the same load in both cases, that would be a strong case pointing towards a fault in Pebble. If our metrics report that we ourselves are having gazillions of cache misses, then it's clearly us who are hitting Pebble like crazy... Exciting... metrics rule. Brb.

jbowens commented 11 months ago

Sounds good. If you're able to, can you also collect measurements of Metrics.ReadAmp over time. If data is written to Pebble more quickly than Pebble is able to compact into lower levels, L0 can accumulate many files and, in particular, many sublevels. Every additional sublevel causes an additional block load during a read. I'm wondering if there's a spike in the number of sublevels, resulting in each Get expending much more time reading blocks.

One more question:

How many CPUs does the process have available (specifically what's runtime.GOMAXPROCS(0) return)?

We scale the block cache shard count according to the number of procs, and this scaling factor was changed recently (ce43e6535942905b82733884aafdca570c30e4ee).

karalabe commented 11 months ago

How many CPUs does the process have available

karalabe commented 11 months ago

This is the CPU chart of our process. It was restarted in the middle, the point when in entered this weird CPU load.

Screenshot 2023-09-28 at 21 42 20

Looking at our data processing pipeline, we track how many data items are missed from a cache and need to be loaded from disk. I.e. each miss is a database read (pretty much 1-to-1 a pebble.Get).

Screenshot 2023-09-28 at 21 42 01

There's a single instantaneous spike, which is essentially the process starting up and the entire cache being cold, so everything is a miss. After the initial spike however, the database reads - as reported by this specific subsystem - is negligible. It is so overloaded that it actually doesn't even get to do the same number of reads as before.

In the original pprof image, the entire load originates from this processing pathway:

Screenshot 2023-09-28 at 21 49 10

And that pathway basically tracks all db reads as a cache miss.

My hunch is still that this amplification somehow originates from Pebble with the above as seemingly proof. Now, I could imagine too much junk accumulating somewhere and triggering slowdowns, but I doubt them due to:

A restart of the process fixes everything. If pebble had screwy files before, a restart should not help.
After a screwy restart, I tried to do a full db compaction to trigger everything there is to trigger, and it worked without any issues. The system was a bit slower, did maybe a couple hundred GB reads whilst doing it and finished without an surprises.

My vote is still something funny in Pebble. That said, I cannot guarantee that some other code path doesn't hammer Pebble. But with networking down... I can't imagine what else there is to hit the disk.

karalabe commented 11 months ago

I'm adding a bit of support to expose the Pebble metrics as a string dump via https://pkg.go.dev/github.com/cockroachdb/pebble#Metrics.String and will get back if I have anything of interest.

karalabe commented 11 months ago

Restarted the thing with Pebble metrics now enabled. On startup (+ a few secs I needed to pull the stats), I've seen

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         2   1.1 M       -   1.1 M       -       -       -       -   1.1 M       -       -       -     1.0
      0       196   716 M    0.50     0 B     0 B       0     0 B       0     0 B       0     0 B       1     0.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2        38    53 M    1.00   286 M     0 B       0     0 B       0   278 M     168   311 M       1     1.0
      3       334   449 M    1.00   238 M     0 B       0    19 M      10   501 M     353   583 M       1     2.1
      4       499   845 M    0.26   164 M     0 B       0     0 B       0   253 M     153   297 M       1     1.5
      5       360   3.4 M    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       1     0.0
      6    116048   214 G       -     0 B     0 B       0     0 B       0     0 B       0     0 B       1     0.0
  total    117475   216 G       -   1.1 M     0 B       0    19 M      10   1.0 G     674   1.2 G       6   949.8
  flush         0                             0 B       0       0  (ingest = tables-ingested, move = ingested-as-flushable)
compact       239   3.0 G     0 B       0                          (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
  ctype       229       0       0      10       0       0       0  (default, delete, elision, move, read, rewrite, multi-level)
 memtbl         2   512 M
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     273 K   229 M   51.8%  (score == hit-rate)
 tcache      41 K    28 M   29.3%  (score == hit-rate)
  snaps         0       -       0  (score == earliest seq num)
 titers         0
 filter         -       -    0.0%  (score == utility)

About 30 mins later I'm at almost 1TB reads, linearly going up with no sight of stopping.

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         2   116 M       -   116 M       -       -       -       -   116 M       -       -       -     1.0
      0       196   716 M    0.50     0 B     0 B       0     0 B       0     0 B       0     0 B       1     0.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2        38    53 M    1.00   286 M     0 B       0     0 B       0   278 M     168   311 M       1     1.0
      3       334   449 M    1.00   238 M     0 B       0    19 M      10   501 M     353   583 M       1     2.1
      4       499   845 M    0.26   164 M     0 B       0     0 B       0   253 M     153   297 M       1     1.5
      5       360   3.4 M    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       1     0.0
      6    116048   214 G       -     0 B     0 B       0     0 B       0     0 B       0     0 B       1     0.0
  total    117475   216 G       -   116 M     0 B       0    19 M      10   1.1 G     674   1.2 G       6     9.9
  flush         0                             0 B       0       0  (ingest = tables-ingested, move = ingested-as-flushable)
compact       239   3.0 G     0 B       0                          (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
  ctype       229       0       0      10       0       0       0  (default, delete, elision, move, read, rewrite, multi-level)
 memtbl         2   512 M
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     755 K   1.5 G   84.4%  (score == hit-rate)
 tcache      30 K    20 M   83.9%  (score == hit-rate)
  snaps         0       -       0  (score == earliest seq num)
 titers         0
 filter         -       -    0.0%  (score == utility)

The CPU load is not as accented as before, but yesterday I did do a whole database compaction, so that might explain why both the CPU and disk isn't hit like crazy now.

Still, the database reads keep growing at an alarming rate. I'll try restarting and see it it normalizes and what that reports. Note, doing a profiling run most of the time is still spend by Pebble doing some CRC hashings, so I assume it's loading data like crazy.

cpuprofile.txt

karalabe commented 11 months ago

After a restart everything seems to work ok, disk reads in the first 30 minutes was around 4GB total.

Pebble metrics after the 30 minute mark are:

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         2   110 M       -   110 M       -       -       -       -   110 M       -       -       -     1.0
      0       218   757 M    0.50     0 B     0 B       0     0 B       0     0 B       0     0 B       1     0.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2        42    62 M    1.00    92 M     0 B       0   203 M       1   110 M      68   114 M       1     1.2
      3       323   442 M    0.99   279 M     0 B       0   3.3 M       3   397 M     128   414 M       1     1.4
      4       533   1.1 G    0.33   272 M     0 B       0     0 B       0   308 M      63   318 M       1     1.1
      5       360   3.4 M    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       1     0.0
      6    116048   214 G       -     0 B     0 B       0     0 B       0     0 B       0     0 B       1     0.0
  total    117524   217 G       -   110 M     0 B       0   206 M       4   925 M     259   846 M       6     8.4
  flush         0                             0 B       0       0  (ingest = tables-ingested, move = ingested-as-flushable)
compact        83   3.3 G     0 B       0                          (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
  ctype        79       0       0       4       0       0       0  (default, delete, elision, move, read, rewrite, multi-level)
 memtbl         2   512 M
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     592 K   1.4 G   85.6%  (score == hit-rate)
 tcache      33 K    22 M   78.6%  (score == hit-rate)
  snaps         0       -       0  (score == earliest seq num)
 titers         0
 filter         -       -    0.0%  (score == utility)

Pebble is also completely missing from the CPU profile it's so insignificant load wise: cpu.profile.ok.txt

karalabe commented 11 months ago

Just to check whether it's networking or some background thing in Pebble:

I've stopped the processing, but otherwise left the thing on the network doing it's thing, answering to requests, etc. Reads flatlined. So they don't originate from the network.
Also the moment I stopped the data processing, Pebble also stopped the reads, so it cannot be compaction or any other long-running background job in Pebble.

Seems to definitely originate from that specific data pipeline. I'll try to add support on mocking out even more stuff from our code to isolate it even better to that thing. It just takes a non trivial effort to jump through all the hoops - on a production server... 😓

karalabe commented 11 months ago

I think I've found something of interest. We have an API call to basically do a direct db read.

So I took a "faulty" run and basically stopped all processing in it. Everything goes idle. Then I did 2 read "benchmarks" (more on the key layer):

1000x Pebble.Get("0x41" ++ <32 len random hex>) => 467ms
1000x Pebble.Get("0x4f" ++ <32 len random hex>) => 62500ms

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1   340 M       -   1.1 G       -       -       -       -   1.1 G       -       -       -     1.0
      0       271   938 M    0.54   795 M     0 B       0     0 B       0   535 M     175     0 B       1     0.7
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2        37    55 M    1.00   598 M     0 B       0     0 B       0   586 M     237   628 M       1     1.0
      3       343   448 M    0.99   546 M     0 B       0    12 M       6   950 M     558   1.2 G       1     1.7
      4       734   1.4 G    0.61   293 M     0 B       0     0 B       0   371 M     120   625 M       1     1.3
      5       360   3.4 M    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       1     0.0
      6    116048   214 G       -     0 B     0 B       0     0 B       0     0 B       0     0 B       1     0.0
  total    117793   217 G       -   1.1 G     0 B       0    12 M       6   3.5 G   1.1 K   2.4 G       6     3.2
  flush         2                             0 B       0       0  (ingest = tables-ingested, move = ingested-as-flushable)
compact       294   4.1 G     0 B       0                          (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
  ctype       288       0       0       6       0       0       0  (default, delete, elision, move, read, rewrite, multi-level)
 memtbl         1   512 M
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     922 K   1.5 G   91.9%  (score == hit-rate)
 tcache      32 K    22 M   95.3%  (score == hit-rate)
  snaps         0       -       0  (score == earliest seq num)
 titers         0
 filter         -       -    0.0%  (score == utility)

Doing a restart, the process seems to execute properly fast now. Running the same benchmarks:

1000x Pebble.Get("0x41" ++ <32 len random hex>) => 445ms
1000x Pebble.Get("0x4f" ++ <32 len random hex>) => 629ms

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         2   104 M       -   103 M       -       -       -       -   104 M       -       -       -     1.0
      0       286   1.0 G    0.57     0 B     0 B       0     0 B       0     0 B       0     0 B       1     0.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2        35    51 M    0.97   177 M     0 B       0   171 M       1   162 M      98   180 M       1     0.9
      3       363   438 M    0.97   328 M     0 B       0   6.7 M       4   458 M     225   498 M       1     1.4
      4       799   1.5 G    0.64   305 M     0 B       0     0 B       0   407 M     145   621 M       1     1.3
      5       360   3.4 M    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       1     0.0
      6    116048   214 G       -     0 B     0 B       0     0 B       0     0 B       0     0 B       1     0.0
  total    117891   218 G       -   104 M     0 B       0   178 M       5   1.1 G     468   1.3 G       6    10.9
  flush         0                             0 B       0       0  (ingest = tables-ingested, move = ingested-as-flushable)
compact       167   4.4 G     0 B       0                          (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
  ctype       162       0       0       5       0       0       0  (default, delete, elision, move, read, rewrite, multi-level)
 memtbl         2   512 M
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     483 K   1.1 G   81.6%  (score == hit-rate)
 tcache      38 K    26 M   69.4%  (score == hit-rate)
  snaps         0       -       0  (score == earliest seq num)
 titers         0
 filter         -       -    0.0%  (score == utility)

Pebble vs us to blame:

In the slow case, everything was stopped and idling, and even like that, certain db reads took forever.
In the normal case, I didn't even bother to turn anything off, and even like that, both read types took the same.
These reads were 1-to-1 direct pebble.Get calls, no other thing in play that might have amplified it our side
This kind of makes a very strong case that here indeed Pebble is at fault.

Now, the good news. One key range worked just as fast in both slow and normal mode; vs another key range exhibited a 100x slowdown in slow mode. The difference is the key range density.

The "0x41" key range contains random keys (they are hash prefixes) of length 0-32 and whilst there are shared prefixes, these are paths in a Merkel Tree, so the "shared" part isn't that high. Essentially, you could consider a fairly well uniformly distributed key range with some minimal clustering.
Now the "0x4f" key range is the opposite. There the key layout exhibits heavy clustering, having tens to hundreds of thoudans of keys sharing the same 33 byte prefix. After the prefix, the same uniform random 0-32 byte split happens as with the first key range.

And this can also be seen from our internal metrics (just we couldn't really trust it), that data processing accessing the first key range is 100x faster than data processing related to the second key range.

Count/size wise, the first key range is possibly significantly smaller than the second key range. One order of magnitude for use, maybe 2, but I'd need to gather some stats, I don't have those laying around.

jbowens commented 11 months ago

The Metrics.String() output taken multiple times throughout while the CPU usage is high would be instructive.
Is your write workload bursty?
Can you provide me with the contents of your OPTIONS-XXXXXX file?

Both of the cpu profiles you provided that show significant DB.Get CPU usage show that it's all in reading L0 files.

         .          .    156:       if g.level == 0 {
         .          .    157:           // Create iterators from L0 from newest to oldest.
         .          .    158:           if n := len(g.l0); n > 0 {
         .       10ms    159:               files := g.l0[n-1].Iter()
         .          .    160:               g.l0 = g.l0[:n-1]
         .          .    161:               iterOpts := IterOptions{logger: g.logger}
         .          .    162:               g.levelIter.init(context.Background(), iterOpts, g.cmp, nil /* split */, g.newIters,
         .          .    163:                   files, manifest.L0Sublevel(n), internalIterOpts{})
         .          .    164:               g.levelIter.initRangeDel(&g.rangeDelIter)
         .          .    165:               g.iter = &g.levelIter
      20ms    466.06s    166:               g.iterKey, g.iterValue = g.iter.SeekGE(g.key, base.SeekGEFlagsNone)
         .          .    167:               continue
         .          .    168:           }
         .          .    169:           g.level++
         .          .    170:       }

        .          .    156:        if g.level == 0 {
         .          .    157:           // Create iterators from L0 from newest to oldest.
         .          .    158:           if n := len(g.l0); n > 0 {
         .          .    159:               files := g.l0[n-1].Iter()
         .          .    160:               g.l0 = g.l0[:n-1]
         .          .    161:               iterOpts := IterOptions{logger: g.logger}
         .          .    162:               g.levelIter.init(context.Background(), iterOpts, g.cmp, nil /* split */, g.newIters,
         .          .    163:                   files, manifest.L0Sublevel(n), internalIterOpts{})
         .          .    164:               g.levelIter.initRangeDel(&g.rangeDelIter)
         .          .    165:               g.iter = &g.levelIter
      10ms    119.77s    166:               g.iterKey, g.iterValue = g.iter.SeekGE(g.key, base.SeekGEFlagsNone)
         .          .    167:               continue
         .          .    168:           }
         .          .    169:           g.level++
         .          .    170:       }

The Metrics.String() output you've provided hasn't shown high sublevels, but they've all been taken after hundreds of compactions have already completed.

karalabe commented 11 months ago

The Metrics.String() output taken multiple times throughout while the CPU usage is high would be instructive.

You have two here: https://github.com/cockroachdb/pebble/issues/2945#issuecomment-1740389293

Is your write workload bursty?

Yes, we accumulate about 64MB of writes and then throw it at Pebble. It does take a while, so often Pebble would be idling on the writes (or very minimal).

Can you provide me with the contents of your OPTIONS-XXXXXX file?

Sure: OPTIONS-1062222.txt

karalabe commented 11 months ago

The Metrics.String() output you've provided hasn't shown high sublevels, but they've all been taken after hundreds of compactions have already completed.

I'm just restarting the same processing and continuing where the previous left off. Sometimes if slows down, sometimes not. I'm still trying to find some correlation there.

karalabe commented 11 months ago

I am noticing a very very weird pattern here, which I cannot fathom to explain why it matters. Consider it a random braindump.

This process is ran via docker. Every time I update the container to a new image (I can just fix a typo, no functional change needed), it will trigger the slowdown. If i just nuke the container and recreate it, stop and restart, whatever els; it doesn't happen.

karalabe commented 11 months ago

Hah, no. Just restarted it and it happened consecutively again. So no, unrelated to update. Perhaps some timing issue somewhere... ugh

karalabe commented 11 months ago

FWIW, I've tried swapping between Pebble 23.1 and the master branch from a few weeks ago (one of our colleagues sent a few PRs in). Until now, this issue only happens on 23.1 but not on master. That said, it's not like it's so easy to trigger, so maybe master just got lucky in my tests so far. Will keep trying.

karalabe commented 11 months ago

I've switched my prod instance over to Pebble master. I could't repro the issue with it, so lets see if running it the same way I ran 23.1 previously will trigger. Will get back if anything suspicious appears.

karalabe commented 11 months ago

I've asked someone else who's been regularly hitting this issue (50% of restarts) to check with Pebble master, and they also confirmed that the issue is gone.

This pretty much concludes that indeed Pebble was/is at fault, however, I don't have the capacity to repro exactly which commit in the last half a year fixed it or why. I am unsure what's the best course of action from your part on this, given that it is a gnarly issue.

jbowens commented 11 months ago

I'm glad it appears resolved. CockroachDB does not make use of the DB.Get API at all, and we've not encountered any similar behavior, so I don't we can dedicated the resources to root cause it.

A bit of an aside, but do you provide a Split function on your Options.Comparer? Historically DB.Get hasn't taken advantage of bloom filters because the bloom filters are defined specifically over the Split-defined "prefix" of a key in order to support MVCC applications like CockroachDB. In #2846 a contributor added support for using bloom filters during DB.Get lookups if Options.Comparer.Split != nil. If you're not already providing a Split function, you may want to experiment with setting Split to func(key []byte) int { return len(key) }.

karalabe commented 11 months ago

I'm aware of that PR yes, we've sent it :)

For the comparer, we just use the Pebble default. Digging through the code, that doesn't seem to set a Split method. Perhaps that should also set something by default if it's relevant?

rjl493456442 commented 11 months ago

@jbowens bloom filter will also be used in Get even Split is not specified, the default prefix is just the whole user key.

jbowens commented 11 months ago

@jbowens bloom filter will also be used in Get even Split is not specified, the default prefix is just the whole user key.

Ah, nice! I missed that. It's possible that you stopped observing high CPU usage in this scenario simply because @rjl493456442's change eliminated unnecessary block loads through bloom filter exclusions. The root cause of the high CPU usage may remain and simply be mitigated.