Open uprendis opened 2 years ago
I restarted the program and after roughly 12 GB of written data the issue repeated. I checked CPU and disk utilization and it turned out that it's actively utilizing CPU and writing to disk. Maybe it isn't exactly a deadlock but rather an excessively long compaction
Hi @uprendis,
It looks like the mutex was held by a completing compaction that was rotating the manifest. It would probably be useful to look at the metrics print out db.Metrics().String()
to see the number of files in the LSM. Also, was CPU saturated at the time?
goroutine 46792096 [runnable]:
github.com/cockroachdb/pebble/internal/manifest.(*LevelIterator).Next(0xc1230335e8)
/home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/internal/manifest/level_metadata.go:376 +0x170
github.com/cockroachdb/pebble.(*versionSet).createManifest(0xc057adae00, {0xc067ff7800, 0xc02efef5c0}, 0xc08c9767a0, 0x344c10, 0x36f378)
/home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/version_set.go:596 +0x372
github.com/cockroachdb/pebble.(*versionSet).logAndApply.func1(0xc057adae00, 0xc104bf4900, 0x2, 0xc123033ab0, 0xc123033a90, 0x36f377, 0xc11175a900, 0x0, 0x57d46)
/home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/version_set.go:425 +0x1e5
github.com/cockroachdb/pebble.(*versionSet).logAndApply(0xc057adae00, 0x2377974cd1bab, 0xc0a3087b80, 0xc0b80598254aa70b, 0x0, 0xc123033d18)
/home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/version_set.go:470 +0x2d3
github.com/cockroachdb/pebble.(*DB).compact1(0xc0001c6d80, 0xc0302b8dc0, 0x0)
/home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/compaction.go:1893 +0x410
github.com/cockroachdb/pebble.(*DB).compact.func1({0x18a8900, 0xc11b8df3b0})
/home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/compaction.go:1849 +0xad
runtime/pprof.Do({0x18a8890, 0xc0000400b0}, {{0xc0002bc8a0, 0xc1177e7680, 0xc09dbc88a0}}, 0xc072164f88)
/usr/local/go/src/runtime/pprof/runtime.go:40 +0xa3
github.com/cockroachdb/pebble.(*DB).compact(0x0, 0x0, 0xc07785b2c0)
/home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/compaction.go:1846 +0x6b
created by github.com/cockroachdb/pebble.(*DB).maybeScheduleCompactionPicker
/home/egor/go/pkg/mod/github.com/cockroachdb/pebble@v0.0.0-20220524133354-f30672e7240b/compaction.go:1640 +0x5c5
Metric of the DB which stalled previously. It's retrieved shortly after restart when it's still processing data:
__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
WAL 1 179 B - 146 B - - - - 179 B - - - 1.2
0 1229992 1.1 T 4.69 0 B 0 B 0 0 B 0 0 B 0 0 B 18 0.0
1 8650 33 G 4.70 125 M 0 B 0 0 B 0 124 M 33 125 M 1 1.0
2 16602 63 G 4.70 0 B 0 B 0 137 M 35 0 B 0 0 B 1 0.0
3 31752 121 G 4.70 0 B 0 B 0 145 M 39 0 B 0 0 B 1 0.0
4 63669 232 G 4.70 0 B 0 B 0 126 M 32 0 B 0 0 B 1 0.0
5 224487 445 G 1.90 0 B 0 B 0 76 M 19 0 B 0 0 B 1 0.0
6 1071546 2.1 T - 0 B 0 B 0 0 B 0 0 B 0 0 B 1 0.0
total 2646698 4.0 T - 179 B 0 B 0 484 M 125 124 M 33 125 M 24 729563.6
flush 0
compact 127 10 T 0 B 2 (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
ctype 2 0 0 125 0 0 (default, delete, elision, move, read, rewrite)
memtbl 1 256 K
zmemtbl 0 0 B
ztbl 0 0 B
bcache 406 K 138 M 0.4% (score == hit-rate)
tcache 202 K 131 M 0.3% (score == hit-rate)
snaps 0 - 0 (score == earliest seq num)
titers 0
filter - - 0.0% (score == utility)
It utilizes ~2.5 cores out of 32
It looks like you have some very tiny files (eg, avg ~2 MB in L6). Are you using the default options or have you configured them? I'm especially interested in the values of the MemTableSize
andFlushSplitBytes
, and the TargetFileSize
on per-level LevelOptions
?
Such tiny files are a problem because compaction picking needs to iterate through all the files in a level, and rotating the manifest needs to iterate through all the files in the LSM.
Here are options for this DB
db, err := pebble.Open(path, &pebble.Options{
Cache: pebble.NewCache(345490132), // default 8 MB
MemTableSize: 172745066, // default 4 MB
MaxOpenFiles: 181254, // default 1000
WALBytesPerSync: 0, // default 0 (matches RocksDB = no background syncing)
MaxConcurrentCompactions: 3, // default 1, important for big imports performance
})
MaxOpenFiles
is adjusted based on the process FD limit. Should it be capped?
The MaxOpenFiles
value should be fine. If you're able to provide it, I'd be interested in the metrics string after the database has been running for a while.
Can you tell me anything about your workload? Is it, for example, mostly randomly distributed writes? Or sequentially ordered writes? Are you writing to the database through Batch application or through Ingest
?
Here is the metric after it worked for 32 minutes and has written ~3.7GB:
__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
WAL 1 0 B - 0 B - - - - 0 B - - - 0.0
0 1227370 1.1 T 4.69 0 B 0 B 0 0 B 0 0 B 0 0 B 18 0.0
1 8641 33 G 4.69 2.4 G 0 B 0 0 B 0 2.4 G 645 2.4 G 1 1.0
2 16608 63 G 4.69 0 B 0 B 0 2.5 G 648 0 B 0 0 B 1 0.0
3 31808 121 G 4.69 0 B 0 B 0 2.4 G 640 0 B 0 0 B 1 0.0
4 63847 232 G 4.69 0 B 0 B 0 2.2 G 588 0 B 0 0 B 1 0.0
5 224996 447 G 1.91 0 B 0 B 0 1.6 G 431 0 B 0 0 B 1 0.0
6 1071546 2.1 T - 0 B 0 B 0 0 B 0 0 B 0 0 B 1 0.0
total 2644816 4.0 T - 0 B 0 B 0 8.8 G 2.3 K 2.4 G 645 2.4 G 24 0.0
flush 0
compact 2336 10 T 0 B 3 (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
ctype 29 0 0 2307 0 0 (default, delete, elision, move, read, rewrite)
memtbl 1 256 K
zmemtbl 0 0 B
ztbl 0 0 B
bcache 339 K 329 M 60.5% (score == hit-rate)
tcache 362 K 235 M 71.5% (score == hit-rate)
snaps 0 - 0 (score == earliest seq num)
titers 0
filter - - 0.0% (score == utility)
This DB went through the following:
2
after a few daysSubsequent attempts to restart it didn't include step 1
All the DBs are stored in ZFS file system
Metric after it worked for 3 hours and has written ~12GB of data. It stalled a few seconds after this metric
__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
WAL 1 0 B - 0 B - - - - 0 B - - - 0.0
0 1215802 1.1 T 4.65 0 B 0 B 0 0 B 0 0 B 0 0 B 18 0.0
1 8609 33 G 4.67 10 G 0 B 0 0 B 0 10 G 2.7 K 10 G 1 1.0
2 16628 63 G 4.66 0 B 0 B 0 10 G 2.8 K 0 B 0 0 B 1 0.0
3 32045 122 G 4.66 0 B 0 B 0 10 G 2.7 K 0 B 0 0 B 1 0.0
4 64624 236 G 4.66 0 B 0 B 0 9.7 G 2.5 K 0 B 0 0 B 1 0.0
5 227151 455 G 1.94 0 B 0 B 0 7.1 G 1.9 K 0 B 0 0 B 1 0.0
6 1071546 2.1 T - 0 B 0 B 0 0 B 0 0 B 0 0 B 1 0.0
total 2636405 4.0 T - 0 B 0 B 0 38 G 9.9 K 10 G 2.7 K 10 G 24 0.0
flush 0
compact 10026 10 T 19 M 3 (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
ctype 105 0 0 9921 0 0 (default, delete, elision, move, read, rewrite)
memtbl 1 256 K
zmemtbl 0 0 B
ztbl 0 0 B
bcache 217 K 329 M 69.2% (score == hit-rate)
tcache 362 K 235 M 85.6% (score == hit-rate)
snaps 0 - 0 (score == earliest seq num)
titers 0
filter - - 0.0% (score == utility)
Thanks, this was helpful.
I manually attempted to run full DB compaction a few times, which always led to crashing due to running out of disk space (it had roughly ~80 GB of available space before compaction).
A full database compaction requires writing all the data to new files before removing the old files, so in this case you would need ~4 T of disk capacity to make that work. You could try dividing the keyspace into 100 portions and manually compacting each portion one after the other.
ctype 105 0 0 9921 0 0 (default, delete, elision, move, read, rewrite)
I think the tiny files are staying around because of move compactions. There were 9,921 move compactions, which move a file to another level without rewriting it. If the file was tiny before the move compaction, it's still tiny after the move compaction.
I'm still not sure where the tiny files originated.
I can upload the DB to your server if it'll help
Hi @uprendis, we recently merged caa5fcde4edd1ac135a7c6390e0486721464bc29 which adapts our manifest rollover logic to gracefully handle very large manifests, like the one you encountered here. You may find it worthwhile to try the latest Pebble.
Sounds great!
Encountered deadlock on
pebble.DB.Apply
while running commit f30672e7240b:The deadlock was encountered only once on a program that was actively updating the DBs for a week. The program had 4 opened Pebble DBs plus one LevelDB.
pebble.DB.Apply
was stuck for ~34h after which the program was forcefully killed. The whole stacktrace is included in the file: https://github.com/uprendis/go-opera/releases/download/stacktrace/st-pebble-deadlock.txtJira issue: PEBBLE-154