cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go
BSD 3-Clause "New" or "Revised" License
4.66k stars 430 forks source link

compaction: slow manifest construction prevents WAL rotation #1915

Open uprendis opened 1 year ago

uprendis commented 1 year ago

Encountered deadlock on pebble.DB.Apply while running commit f30672e7240b:

goroutine 24171281 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc0001c6ff0, 0x2921d)
    /usr/local/go/src/runtime/sema.go:513 +0x13d
sync.(*Cond).Wait(0xc0001c6fe0)
    /usr/local/go/src/sync/cond.go:56 +0x8c
github.com/cockroachdb/pebble.(*DB).makeRoomForWrite(0xc0001c6d80, 0xc026a2b8c0)
    /home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/db.go:1722 +0x2f1
github.com/cockroachdb/pebble.(*DB).commitWrite(0xc0001c6d80, 0xc026a2b8c0, 0xc0001eb820, 0xc026a2b8c0)
    /home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/db.go:818 +0x14b
github.com/cockroachdb/pebble.(*commitPipeline).prepare(0xc0001eb300, 0xc026a2b8c0, 0x12)
    /home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/commit.go:379 +0x144
github.com/cockroachdb/pebble.(*commitPipeline).Commit(0xc0001eb300, 0xc026a2b8c0, 0x95)
    /home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/commit.go:253 +0x5e
github.com/cockroachdb/pebble.(*DB).Apply(0xc0001c6d80, 0xc026a2b8c0, 0xdd498a)
    /home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/db.go:746 +0x1d1
github.com/Fantom-foundation/lachesis-base/kvdb/pebble.(*batch).Write(0xc08ba8de58)

The deadlock was encountered only once on a program that was actively updating the DBs for a week. The program had 4 opened Pebble DBs plus one LevelDB. pebble.DB.Apply was stuck for ~34h after which the program was forcefully killed. The whole stacktrace is included in the file: https://github.com/uprendis/go-opera/releases/download/stacktrace/st-pebble-deadlock.txt

Jira issue: PEBBLE-154

uprendis commented 1 year ago

I restarted the program and after roughly 12 GB of written data the issue repeated. I checked CPU and disk utilization and it turned out that it's actively utilizing CPU and writing to disk. Maybe it isn't exactly a deadlock but rather an excessively long compaction

jbowens commented 1 year ago

Hi @uprendis,

It looks like the mutex was held by a completing compaction that was rotating the manifest. It would probably be useful to look at the metrics print out db.Metrics().String() to see the number of files in the LSM. Also, was CPU saturated at the time?

goroutine 46792096 [runnable]:
github.com/cockroachdb/pebble/internal/manifest.(*LevelIterator).Next(0xc1230335e8)
    /home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/internal/manifest/level_metadata.go:376 +0x170
github.com/cockroachdb/pebble.(*versionSet).createManifest(0xc057adae00, {0xc067ff7800, 0xc02efef5c0}, 0xc08c9767a0, 0x344c10, 0x36f378)
    /home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/version_set.go:596 +0x372
github.com/cockroachdb/pebble.(*versionSet).logAndApply.func1(0xc057adae00, 0xc104bf4900, 0x2, 0xc123033ab0, 0xc123033a90, 0x36f377, 0xc11175a900, 0x0, 0x57d46)
    /home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/version_set.go:425 +0x1e5
github.com/cockroachdb/pebble.(*versionSet).logAndApply(0xc057adae00, 0x2377974cd1bab, 0xc0a3087b80, 0xc0b80598254aa70b, 0x0, 0xc123033d18)
    /home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/version_set.go:470 +0x2d3
github.com/cockroachdb/pebble.(*DB).compact1(0xc0001c6d80, 0xc0302b8dc0, 0x0)
    /home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/compaction.go:1893 +0x410
github.com/cockroachdb/pebble.(*DB).compact.func1({0x18a8900, 0xc11b8df3b0})
    /home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/compaction.go:1849 +0xad
runtime/pprof.Do({0x18a8890, 0xc0000400b0}, {{0xc0002bc8a0, 0xc1177e7680, 0xc09dbc88a0}}, 0xc072164f88)
    /usr/local/go/src/runtime/pprof/runtime.go:40 +0xa3
github.com/cockroachdb/pebble.(*DB).compact(0x0, 0x0, 0xc07785b2c0)
    /home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/compaction.go:1846 +0x6b
created by github.com/cockroachdb/pebble.(*DB).maybeScheduleCompactionPicker
    /home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/compaction.go:1640 +0x5c5
uprendis commented 1 year ago

Metric of the DB which stalled previously. It's retrieved shortly after restart when it's still processing data:

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1   179 B       -   146 B       -       -       -       -   179 B       -       -       -     1.2
      0   1229992   1.1 T    4.69     0 B     0 B       0     0 B       0     0 B       0     0 B      18     0.0
      1      8650    33 G    4.70   125 M     0 B       0     0 B       0   124 M      33   125 M       1     1.0
      2     16602    63 G    4.70     0 B     0 B       0   137 M      35     0 B       0     0 B       1     0.0
      3     31752   121 G    4.70     0 B     0 B       0   145 M      39     0 B       0     0 B       1     0.0
      4     63669   232 G    4.70     0 B     0 B       0   126 M      32     0 B       0     0 B       1     0.0
      5    224487   445 G    1.90     0 B     0 B       0    76 M      19     0 B       0     0 B       1     0.0
      6   1071546   2.1 T       -     0 B     0 B       0     0 B       0     0 B       0     0 B       1     0.0
  total   2646698   4.0 T       -   179 B     0 B       0   484 M     125   124 M      33   125 M      24 729563.6
  flush         0
compact       127    10 T     0 B       2          (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
  ctype         2       0       0     125       0       0  (default, delete, elision, move, read, rewrite)
 memtbl         1   256 K
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     406 K   138 M    0.4%  (score == hit-rate)
 tcache     202 K   131 M    0.3%  (score == hit-rate)
  snaps         0       -       0  (score == earliest seq num)
 titers         0
 filter         -       -    0.0%  (score == utility)

It utilizes ~2.5 cores out of 32

jbowens commented 1 year ago

It looks like you have some very tiny files (eg, avg ~2 MB in L6). Are you using the default options or have you configured them? I'm especially interested in the values of the MemTableSize andFlushSplitBytes, and the TargetFileSize on per-level LevelOptions?

Such tiny files are a problem because compaction picking needs to iterate through all the files in a level, and rotating the manifest needs to iterate through all the files in the LSM.

uprendis commented 1 year ago

Here are options for this DB

    db, err := pebble.Open(path, &pebble.Options{
        Cache:                    pebble.NewCache(345490132), // default 8 MB
        MemTableSize:             172745066,                  // default 4 MB
        MaxOpenFiles:             181254,                     // default 1000
        WALBytesPerSync:          0,                          // default 0 (matches RocksDB = no background syncing)
        MaxConcurrentCompactions: 3,                          // default 1, important for big imports performance
    })

MaxOpenFiles is adjusted based on the process FD limit. Should it be capped?

jbowens commented 1 year ago

The MaxOpenFiles value should be fine. If you're able to provide it, I'd be interested in the metrics string after the database has been running for a while.

Can you tell me anything about your workload? Is it, for example, mostly randomly distributed writes? Or sequentially ordered writes? Are you writing to the database through Batch application or through Ingest?

uprendis commented 1 year ago

Here is the metric after it worked for 32 minutes and has written ~3.7GB:

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1     0 B       -     0 B       -       -       -       -     0 B       -       -       -     0.0
      0   1227370   1.1 T    4.69     0 B     0 B       0     0 B       0     0 B       0     0 B      18     0.0
      1      8641    33 G    4.69   2.4 G     0 B       0     0 B       0   2.4 G     645   2.4 G       1     1.0
      2     16608    63 G    4.69     0 B     0 B       0   2.5 G     648     0 B       0     0 B       1     0.0
      3     31808   121 G    4.69     0 B     0 B       0   2.4 G     640     0 B       0     0 B       1     0.0
      4     63847   232 G    4.69     0 B     0 B       0   2.2 G     588     0 B       0     0 B       1     0.0
      5    224996   447 G    1.91     0 B     0 B       0   1.6 G     431     0 B       0     0 B       1     0.0
      6   1071546   2.1 T       -     0 B     0 B       0     0 B       0     0 B       0     0 B       1     0.0
  total   2644816   4.0 T       -     0 B     0 B       0   8.8 G   2.3 K   2.4 G     645   2.4 G      24     0.0
  flush         0
compact      2336    10 T     0 B       3          (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
  ctype        29       0       0    2307       0       0  (default, delete, elision, move, read, rewrite)
 memtbl         1   256 K
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     339 K   329 M   60.5%  (score == hit-rate)
 tcache     362 K   235 M   71.5%  (score == hit-rate)
  snaps         0       -       0  (score == earliest seq num)
 titers         0
 filter         -       -    0.0%  (score == utility)

This DB went through the following:

  1. Copied data from another DB to it (strictly in lexicographic order) through batches, where each batch was ~100KB. During the copying, I manually attempted to run full DB compaction a few times, which always led to crashing due to running out of disk space (it had roughly ~80 GB of available space before compaction). That's when ~95% of data was written
  2. It was writing randomly distributed data through batches, where each batch was ~100KB
  3. It has stalled on step 2 after a few days

Subsequent attempts to restart it didn't include step 1

All the DBs are stored in ZFS file system

uprendis commented 1 year ago

Metric after it worked for 3 hours and has written ~12GB of data. It stalled a few seconds after this metric

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1     0 B       -     0 B       -       -       -       -     0 B       -       -       -     0.0
      0   1215802   1.1 T    4.65     0 B     0 B       0     0 B       0     0 B       0     0 B      18     0.0
      1      8609    33 G    4.67    10 G     0 B       0     0 B       0    10 G   2.7 K    10 G       1     1.0
      2     16628    63 G    4.66     0 B     0 B       0    10 G   2.8 K     0 B       0     0 B       1     0.0
      3     32045   122 G    4.66     0 B     0 B       0    10 G   2.7 K     0 B       0     0 B       1     0.0
      4     64624   236 G    4.66     0 B     0 B       0   9.7 G   2.5 K     0 B       0     0 B       1     0.0
      5    227151   455 G    1.94     0 B     0 B       0   7.1 G   1.9 K     0 B       0     0 B       1     0.0
      6   1071546   2.1 T       -     0 B     0 B       0     0 B       0     0 B       0     0 B       1     0.0
  total   2636405   4.0 T       -     0 B     0 B       0    38 G   9.9 K    10 G   2.7 K    10 G      24     0.0
  flush         0
compact     10026    10 T    19 M       3          (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
  ctype       105       0       0    9921       0       0  (default, delete, elision, move, read, rewrite)
 memtbl         1   256 K
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     217 K   329 M   69.2%  (score == hit-rate)
 tcache     362 K   235 M   85.6%  (score == hit-rate)
  snaps         0       -       0  (score == earliest seq num)
 titers         0
 filter         -       -    0.0%  (score == utility)
jbowens commented 1 year ago

Thanks, this was helpful.

I manually attempted to run full DB compaction a few times, which always led to crashing due to running out of disk space (it had roughly ~80 GB of available space before compaction).

A full database compaction requires writing all the data to new files before removing the old files, so in this case you would need ~4 T of disk capacity to make that work. You could try dividing the keyspace into 100 portions and manually compacting each portion one after the other.

  ctype       105       0       0    9921       0       0  (default, delete, elision, move, read, rewrite)

I think the tiny files are staying around because of move compactions. There were 9,921 move compactions, which move a file to another level without rewriting it. If the file was tiny before the move compaction, it's still tiny after the move compaction.

I'm still not sure where the tiny files originated.

uprendis commented 1 year ago

I can upload the DB to your server if it'll help

jbowens commented 1 year ago

Hi @uprendis, we recently merged caa5fcde4edd1ac135a7c6390e0486721464bc29 which adapts our manifest rollover logic to gracefully handle very large manifests, like the one you encountered here. You may find it worthwhile to try the latest Pebble.

uprendis commented 1 year ago

Sounds great!