cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

testing: runtime assertions #94986

Open tbg opened 1 year ago

tbg commented 1 year ago

Describe the problem

We don't have great guidelines/infra for runtime assertions. Some of us have, in previous jobs, worked on products that relied heavily on runtime assertions (that would be compiled out in the production builds) and have found this very useful.

Currently, there are a few issues:

  1. there isn't a great way to even write assertions succinctly. A require-style API would be great, but instead you have to do something like
if buildutils.CrdbTestBuild {
  if fooCondition != barCondition {
    log.Fatalf(ctx, "%s", errors.AssertionFailedf("Halp! fooCondition was %v!", barCondition)
  }
}

That's pretty noisy and tends to pollute the code, and also it's just long so people don't type it when they should.

  1. they're not enabled during nightly roachtests. This is because we only have this one class of assertions, and lots of very slow stuff might be gated behind it. We don't have a good separation between "cheap" and "expensive" assertions, and we don't have a clear distinction between "performance" and "correctness" roachtests at the moment.

https://github.com/cockroachdb/cockroach/issues/94979 has some links to ideas.

Jira issue: CRDB-23260

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/test-eng

erikgrinaker commented 1 year ago

There's been some prior internal discussion on this here: https://cockroachlabs.slack.com/archives/C023S0V4YEB/p1658242296301019.

srosenberg commented 1 year ago

For 1), we do need to determine a require-style API that just works. Do you think we could just rely on Fatalf everywhere? I might be paranoid, but I'd like to be certain that an assert will do the right thing and terminate the process. Go's runtime.fatal seems ideal for that. E.g.,

cat assert.go
package main

import (
    _ "unsafe"
)

//go:linkname unrecoverableFatal runtime.fatal
func unrecoverableFatal(s string)

func assert(cond bool, msg string) {
    if !cond {
        unrecoverableFatal(msg)
    }
}

func main() {
    assert(true == false, "just maybe")
}
GOTRACEBACK=crash go run assert.go 2>&1|head -5
fatal error: just maybe

goroutine 1 [running]:
runtime.fatal({0x102fb52f5?, 0x140000021a0?})
    /usr/local/go/src/runtime/panic.go:1066 +0x40 fp=0x1400006ef50 sp=0x1400006ef20 pc=0x102f8bf10
srosenberg commented 1 year ago

For 2), we already have the instrumented binary as one of the roachtest artifacts [1]. (Its path is written into testImpl.cockroachShort.) We just need a clean way to opt in (or out), ideally via TestSpec.

There is also [2] which is related to 2).

[1] https://github.com/cockroachdb/cockroach/blob/a3d53668ff52c2306f37fcad8e9212ff665ca115/build/teamcity/cockroach/nightlies/roachtest_nightly_impl.sh#L27 [2] https://github.com/cockroachdb/cockroach/issues/86678

tbg commented 1 year ago

Do you think we could just rely on Fatalf everywhere?

I think so, since that also does redaction, sentry reporting, etc, which we may want if we deploy these builds to long-lived clusters and perhaps even at customer testing sites (who knows).

Fatalf already attempts to make very sure it terminates the process even if disks stall, etc. I think there are some holes there but they can be fixed.

tbg commented 1 year ago

Here's something that I think I for one would be happy using: https://github.com/cockroachdb/cockroach/pull/95727

(Not planning to push this over the finish line but maybe someone will? And if nothing else it's food for thought).

arulajmani commented 1 year ago

I'm probably missing something, but why do we not like logcrash.ReportOrPanic for doing runtime assertions?

erikgrinaker commented 1 year ago

I think it mostly does what we want. I'm not sure if it's enabled in roachtests/roachprod and for local dev builds, we should make sure it is.

My main gripe with it is that I want to add assertions in performance-critical code paths, where the assertion check itself is too expensive to enable in production. For example in the MVCC iterator hot path. So we need a mechanism that will entirely compile out the assertions in production builds without any performance impact. We'll likely want three different variants:

  1. Report in production, crash in dev.
  2. Don't check in production, crash in dev.
  3. Always crash.

Also, we should have some syntactic sugar for it that's terser and easier to use. Maybe something like:

crash.If(1 != 1, "uh oh")
crash.Assert(1 == 1, "uh oh")
crash.AssertExpensive(computeMVCCStats(db) == ms, "recomputed stats differ from in-memory stats")
crash.AssertAlways(1 == 1, "uh oh")
crash.Now("uh oh")
srosenberg commented 1 year ago

I wonder if we could piggyback off of log.V and the whole vmodule thing? It might be nice to dynamically and selectively turn on some assertions. I realize this might be at odds with the (compiler) optimizer, e.g., large assertion may disable inlining or other optimizations... but, is it that different from code paths which rely on conditional, verbose logging?

srosenberg commented 1 year ago

On second thought, maybe we're overthinking it? Has everyone seen [1]? Rather than cook up fancy interfaces, why don't we go back to basics? E.g., does one really need to pass an error message into assert? Why not put a comment right above it; as long as the line number is captured in the stack trace, you have the rest of the context in the source code; e.g., https://github.com/sqlite/sqlite/blob/41ce47c4f4bcae3882fdccec18a6100a85f4bba5/src/btmutex.c#L71-L97

[1] https://www.sqlite.org/assert.html

erikgrinaker commented 1 year ago

Rather than cook up fancy interfaces, why don't we go back to basics? E.g., does one really need to pass an error message into assert?

Yes, we need a message. For assertions enabled in production builds, cluster operators need a human-readable message if they hit it, and we also want them in Sentry reports. They'll frequently also contain useful context via fmt args, e.g. the specific values that violate the assertion.

I wonder if we could piggyback off of log.V and the whole vmodule thing? It might be nice to dynamically and selectively turn on some assertions.

We often turn on vmodule for debugging production incidents. I wouldn't want that to also arm a bunch of fatal assertions.

I don't think any of this is particularly hard, someone just needs to set aside a day or two and get it done. I was hoping to do something over breather week, but got bogged down in GA blockers and test flake.

srosenberg commented 1 year ago

Yes, we need a message. For assertions enabled in production builds...

Is there a consensus that we'd want to enable assertions in production builds? I am open to both approaches, but note that having them in production builds goes against the conventional wisdom.

erikgrinaker commented 1 year ago

I think there are some assertions where we know the node is broken and we want to kill it to prevent further damage to the node/app/cluster. Stuff like SST checksum mismatches or consistency checker failures that would serve incorrect data to clients. Unclear if we'd consider these assertions as such though, but probably if we want to send them via Sentry (which I think we do).

erikgrinaker commented 1 year ago

Also, many assertions would send a Sentry report in production builds, but not fatal the node. We'd want messages in those Sentry reports.

erikgrinaker commented 1 year ago

I'm gonna pick this up as a Friday project.

erikgrinaker commented 1 year ago

Submitted a runtime assertions API in #106508.

erikgrinaker commented 1 year ago

I'm reverting must due to performance concerns in https://github.com/cockroachdb/cockroach/pull/108272, will revisit a different API.

erikgrinaker commented 1 year ago

Let's try this again: #111991