cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go
BSD 3-Clause "New" or "Revised" License
4.81k stars 448 forks source link

db: ingest sstable regression #4001

Open jeffswenson opened 4 hours ago

jeffswenson commented 4 hours ago

The DR team maintains a roachtest that measures how long it takes to complete the link phase of an online restore. The roachtest recently started failing due to a regression in remote sst ingestion performance (https://github.com/cockroachdb/cockroach/issues/129924).

I dug through goroutine stacks of an impacted cluster and realized all of the ingests were blocked waiting for the DB mutex. In the stack dumps, I see db.EstimateDiskUsageByBackingType holding the db mutex while making a remote HTTP call to fetch the size of an ingested SST.

...
8   github.com/cockroachdb/pebble.(*DB).EstimateDiskUsageByBackingType com_github_cockroachdb_pebble/db.go:2305
9   github.com/cockroachdb/pebble/internal/manifest.(*Annotator[go.shape.uint64]).VersionRangeAnnotation manifest/annotator.go:293
10  github.com/cockroachdb/pebble/internal/manifest.(*Annotator[go.shape.uint64]).VersionRangeAnnotation.func1 manifest/annotator.go:286
11  github.com/cockroachdb/pebble/internal/manifest.(*Annotator[go.shape.uint64]).accumulateRangeAnnotation manifest/annotator.go:205
12  github.com/cockroachdb/pebble/internal/manifest.(*Annotator[go.shape.uint64]).accumulateRangeAnnotation manifest/annotator.go:205
13  github.com/cockroachdb/pebble/internal/manifest.(*Annotator[go.shape.uint64]).accumulateRangeAnnotation manifest/annotator.go:180
14  github.com/cockroachdb/pebble/internal/manifest.(*SumAggregator).AccumulatePartialOverlap ./<autogenerated>:1
15  github.com/cockroachdb/pebble/internal/manifest.SumAggregator.AccumulatePartialOverlap manifest/annotator.go:345
16  github.com/cockroachdb/pebble.Open.(*DB).makeFileSizeAnnotator.func17 com_github_cockroachdb_pebble/db.go:2252
17  github.com/cockroachdb/pebble.(*tableCacheContainer).estimateSize com_github_cockroachdb_pebble/table_cache.go:213
18  github.com/cockroachdb/pebble.(*tableCacheContainer).withCommonReader com_github_cockroachdb_pebble/table_cache.go:237
19  github.com/cockroachdb/pebble.(*tableCacheShard).findNode com_github_cockroachdb_pebble/table_cache.go:791
20  github.com/cockroachdb/pebble.(*tableCacheShard).findNodeInternal com_github_cockroachdb_pebble/table_cache.go:878
21  runtime/pprof.Do pprof/runtime.go:51
22  github.com/cockroachdb/pebble.(*tableCacheShard).findNodeInternal.func2 com_github_cockroachdb_pebble/table_cache.go:879
23  github.com/cockroachdb/pebble.(*tableCacheValue).load com_github_cockroachdb_pebble/table_cache.go:1116
24  github.com/cockroachdb/pebble/objstorage/objstorageprovider.(*provider).OpenForReading objstorageprovider/provider.go:306
25  github.com/cockroachdb/pebble/objstorage/objstorageprovider.(*provider).remoteOpenForReading objstorageprovider/remote.go:344
26  github.com/cockroachdb/cockroach/pkg/storage.(*externalStorageWrapper).ReadObject storage/shared_storage.go:116
27  github.com/cockroachdb/cockroach/pkg/cloud.(*esWrapper).Size ./<autogenerated>:1
28  github.com/cockroachdb/cockroach/pkg/cloud/gcp.(*gcsStorage).Size gcp/gcs_storage.go:376
29  github.com/cockroachdb/cockroach/pkg/util/timeutil.RunWithTimeout timeutil/timeout.go:33
30  github.com/cockroachdb/cockroach/pkg/cloud/gcp.(*gcsStorage).Size.func1 gcp/gcs_storage.go:380
31  cloud.google.com/go/storage.(*ObjectHandle).NewReader com_google_cloud_go_storage/reader.go:76
32  cloud.google.com/go/storage.(*ObjectHandle).NewRangeReader com_google_cloud_go_storage/reader.go:118
33  cloud.google.com/go/storage.(*httpStorageClient).NewRangeReader com_google_cloud_go_storage/http_client.go:887
34  cloud.google.com/go/storage.(*httpStorageClient).NewRangeReader.func2 com_google_cloud_go_storage/http_client.go:833
35  cloud.google.com/go/storage.run com_google_cloud_go_storage/invoke.go:65
36  cloud.google.com/go/internal.Retry internal/retry.go:33
37  cloud.google.com/go/internal.retry internal/retry.go:40
38  cloud.google.com/go/storage.run.func1 com_google_cloud_go_storage/invoke.go:67
39  cloud.google.com/go/storage.(*httpStorageClient).NewRangeReader.func2.1 com_google_cloud_go_storage/http_client.go:834
40  net/http.(*Client).Do http/client.go:590
...

This points to https://github.com/cockroachdb/pebble/pull/3848 as the source of the regression. I built a version of cockroach with #3848 reverted and confirmed the test passes. The relevant change in the PR is this block of code that calulate stats while holding the DB mutex:

image

https://cockroachlabs.slack.com/archives/C03JCUUSCD6/p1727968615845939 contains some internal discussion on the issue.

Jira issue: PEBBLE-269

itsbilal commented 4 hours ago

That part of the code should be holding a ref to the readState already, so I don't see why we also need to hold the db mutex. Nice find! Definitely worth addressing soon (and possibly also backporting the fix to 24.2/24.3, seeing as even for local sstables it's undesirable to do IO while holding the db mutex).

jbowens commented 3 hours ago

That part of the code should be holding a ref to the readState already, so I don't see why we also need to hold the db mutex.

Unfortunately the DB.mu is currently used to provide mutual exclusion for the cached annotation values, so we'll need to account for that.