cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go
BSD 3-Clause "New" or "Revised" License
4.81k stars 449 forks source link

remove potential os.Exit and enforce via linter #2061

Open srosenberg opened 1 year ago

srosenberg commented 1 year ago

In light of a recent exit 1 bug [1], [2], it seems prudent to revisit the contract wrt os.Exit vs. panic. Since pebble is consumed by CRDB as a library, exit should be eschewed. In CRDB, os.Exit is forbidden; this is enforced by the TestOsExit linter [3]. Running the linter against pebble reveals a number of potential os.Exit cases,

git grep -nEw "os\.Exit" -- "*.go" ':!*_test.go' ':!tool'
internal/cache/clockpro.go:407:                 os.Exit(1)
internal/cache/clockpro.go:412:                 os.Exit(1)
internal/cache/clockpro.go:417:                 os.Exit(1)
internal/cache/clockpro.go:439:                         os.Exit(1)
internal/cache/clockpro.go:453:                 os.Exit(1)
internal/cache/clockpro.go:680:                 os.Exit(1)
internal/cache/entry_invariants.go:31:                  os.Exit(1)
internal/cache/robin_hood.go:142:                       os.Exit(1)
internal/cache/robin_hood.go:299:                               os.Exit(1)
internal/cache/robin_hood.go:304:                               os.Exit(1)
internal/cache/value_invariants.go:38:                  os.Exit(1)
logger.go:32:   os.Exit(1)
mem_table.go:105:               os.Exit(1)
open.go:507:                    os.Exit(1)
sstable/reader.go:237:          os.Exit(1)
sstable/reader.go:241:          os.Exit(1)
sstable/reader.go:249:          os.Exit(1)
sstable/reader.go:253:          os.Exit(1)

[1] https://github.com/cockroachdb/pebble/pull/2020 [2] https://github.com/cockroachdb/cockroach/pull/90478 [3] https://github.com/cockroachdb/cockroach/blob/master/pkg/testutils/lint/lint_test.go#L1239

Jira issue: PEBBLE-136

jbowens commented 1 year ago

Do we have a mechanism of ensuring panics are NOT recovered? I know in many instances we'll recover panics and, for example, report to Sentry. In most of these instances, we do want Cockroach to process exit because it's fundamentally unsafe to continue running.

srosenberg commented 1 year ago

Do we have a mechanism of ensuring panics are NOT recovered?

Not to my knowledge. Calls to pebble (from crdb) are typically executed via Stopper.RunXXX which ensures any panic is reported (logcrash.ReportPanicWithGlobalSettings) first [1], and then propagated all the way to start.go which will handle panic in RecoverAndReportPanic [2], eventually leaving it to the runtime to crash the process. E.g., the stacktrace in [3] has pebble's SetMinVersion invoked from grpc's interceptor, using Stopper.RunTaskWithErr.

I know in many instances we'll recover panics and, for example, report to Sentry.

Could you give a couple of examples. In the majority of cases that I've seen, panics from pebble will not be recovered.

In most of these instances, we do want Cockroach to process exit because it's fundamentally unsafe to continue running.

Right, this is the case where we absolutely must crash the process at this moment in order to preserve data integrity. What about runtime's throw? I see it's already used by pebble [4].

[1] https://github.com/cockroachdb/cockroach/blob/02ff871a4b4e16f7082298d7e098bfed70f528a9/pkg/util/stop/stopper.go#L338 [2] https://github.com/cockroachdb/cockroach/blob/02ff871a4b4e16f7082298d7e098bfed70f528a9/pkg/cli/start.go#L658 [3] https://github.com/cockroachdb/pebble/issues/2019#issue-1416668702 [4] https://github.com/cockroachdb/pebble/blob/b53472d0ec8663867ebb480ec64214ecf15e5f86/internal/manual/manual.go#L45

nicktrav commented 1 year ago

Thoughts from Jackson - we channel all of these "assertion" exits into a common codepath that does the exit. That way we have these in the one place.

And then we add a linter over top that ensures that we don't call exit other than this new codepath.