Open tbg opened 1 year ago
cc @cockroachdb/test-eng
There's been some prior internal discussion on this here: https://cockroachlabs.slack.com/archives/C023S0V4YEB/p1658242296301019.
For 1), we do need to determine a require-style API that just works. Do you think we could just rely on Fatalf
everywhere? I might be paranoid, but I'd like to be certain that an assert
will do the right thing and terminate the process. Go's runtime.fatal
seems ideal for that. E.g.,
cat assert.go
package main
import (
_ "unsafe"
)
//go:linkname unrecoverableFatal runtime.fatal
func unrecoverableFatal(s string)
func assert(cond bool, msg string) {
if !cond {
unrecoverableFatal(msg)
}
}
func main() {
assert(true == false, "just maybe")
}
GOTRACEBACK=crash go run assert.go 2>&1|head -5
fatal error: just maybe
goroutine 1 [running]:
runtime.fatal({0x102fb52f5?, 0x140000021a0?})
/usr/local/go/src/runtime/panic.go:1066 +0x40 fp=0x1400006ef50 sp=0x1400006ef20 pc=0x102f8bf10
For 2), we already have the instrumented binary as one of the roachtest artifacts [1]. (Its path is written into testImpl.cockroachShort
.) We just need a clean way to opt in (or out), ideally via TestSpec
.
There is also [2] which is related to 2).
[1] https://github.com/cockroachdb/cockroach/blob/a3d53668ff52c2306f37fcad8e9212ff665ca115/build/teamcity/cockroach/nightlies/roachtest_nightly_impl.sh#L27 [2] https://github.com/cockroachdb/cockroach/issues/86678
Do you think we could just rely on Fatalf everywhere?
I think so, since that also does redaction, sentry reporting, etc, which we may want if we deploy these builds to long-lived clusters and perhaps even at customer testing sites (who knows).
Fatalf
already attempts to make very sure it terminates the process even if disks stall, etc. I think there are some holes there but they can be fixed.
Here's something that I think I for one would be happy using: https://github.com/cockroachdb/cockroach/pull/95727
(Not planning to push this over the finish line but maybe someone will? And if nothing else it's food for thought).
I'm probably missing something, but why do we not like logcrash.ReportOrPanic
for doing runtime assertions?
I think it mostly does what we want. I'm not sure if it's enabled in roachtests/roachprod and for local dev builds, we should make sure it is.
My main gripe with it is that I want to add assertions in performance-critical code paths, where the assertion check itself is too expensive to enable in production. For example in the MVCC iterator hot path. So we need a mechanism that will entirely compile out the assertions in production builds without any performance impact. We'll likely want three different variants:
Also, we should have some syntactic sugar for it that's terser and easier to use. Maybe something like:
crash.If(1 != 1, "uh oh")
crash.Assert(1 == 1, "uh oh")
crash.AssertExpensive(computeMVCCStats(db) == ms, "recomputed stats differ from in-memory stats")
crash.AssertAlways(1 == 1, "uh oh")
crash.Now("uh oh")
I wonder if we could piggyback off of log.V
and the whole vmodule
thing? It might be nice to dynamically and selectively turn on some assertions. I realize this might be at odds with the (compiler) optimizer, e.g., large assertion may disable inlining or other optimizations... but, is it that different from code paths which rely on conditional, verbose logging?
On second thought, maybe we're overthinking it? Has everyone seen [1]? Rather than cook up fancy interfaces, why don't we go back to basics? E.g., does one really need to pass an error message into assert
? Why not put a comment right above it; as long as the line number is captured in the stack trace, you have the rest of the context in the source code; e.g., https://github.com/sqlite/sqlite/blob/41ce47c4f4bcae3882fdccec18a6100a85f4bba5/src/btmutex.c#L71-L97
Rather than cook up fancy interfaces, why don't we go back to basics? E.g., does one really need to pass an error message into assert?
Yes, we need a message. For assertions enabled in production builds, cluster operators need a human-readable message if they hit it, and we also want them in Sentry reports. They'll frequently also contain useful context via fmt args, e.g. the specific values that violate the assertion.
I wonder if we could piggyback off of log.V and the whole vmodule thing? It might be nice to dynamically and selectively turn on some assertions.
We often turn on vmodule for debugging production incidents. I wouldn't want that to also arm a bunch of fatal assertions.
I don't think any of this is particularly hard, someone just needs to set aside a day or two and get it done. I was hoping to do something over breather week, but got bogged down in GA blockers and test flake.
Yes, we need a message. For assertions enabled in production builds...
Is there a consensus that we'd want to enable assertions in production builds? I am open to both approaches, but note that having them in production builds goes against the conventional wisdom.
I think there are some assertions where we know the node is broken and we want to kill it to prevent further damage to the node/app/cluster. Stuff like SST checksum mismatches or consistency checker failures that would serve incorrect data to clients. Unclear if we'd consider these assertions as such though, but probably if we want to send them via Sentry (which I think we do).
Also, many assertions would send a Sentry report in production builds, but not fatal the node. We'd want messages in those Sentry reports.
I'm gonna pick this up as a Friday project.
Submitted a runtime assertions API in #106508.
I'm reverting must
due to performance concerns in https://github.com/cockroachdb/cockroach/pull/108272, will revisit a different API.
Let's try this again: #111991
Describe the problem
We don't have great guidelines/infra for runtime assertions. Some of us have, in previous jobs, worked on products that relied heavily on runtime assertions (that would be compiled out in the production builds) and have found this very useful.
Currently, there are a few issues:
require
-style API would be great, but instead you have to do something likeThat's pretty noisy and tends to pollute the code, and also it's just long so people don't type it when they should.
https://github.com/cockroachdb/cockroach/issues/94979 has some links to ideas.
Jira issue: CRDB-23260