Open davepacheco opened 2 years ago
Hello, I am Blathers. I am here to help you get the issue triaged.
Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.
I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.
cc @cockroachdb/replication
That last part of your comment, about what the go GC might or might not think of uninitialized memory we allocate via cgo, looks like a technical problem where @petermattis would be the best expert.
I dug a little further and found that the manual allocation path appears to work around this problem: https://github.com/cockroachdb/pebble/blob/936e011bb911a82b8711196b7c042c7d57b7c59f/internal/manual/manual.go#L28-L40
I'm still not sure the Pebble code is really safe? The Go guidelines don't say that the only way this can fail is if there's no other reference to the object then it might get collected. It just says "Go code may not store a Go pointer in C memory...programs that break these rules are likely to fail in unexpected and unpredictable ways."
These threads make me wonder what might happen if you're running in a configuration with non-Go-managed memory for the cache, and the memory is uninitialized, and it happens to contain Go pointers. Might the Go GC incorrectly follow these pointers? Could that not cause all kinds of problems, including cases where GC finds an apparently-referenced object that's actually free? (context: I'm seeing a bunch of crashes from the Go runtime memory allocator that say exactly that.)
I'm pretty sure that the Go GC never steps through non-Go memory. That is, it only traces memory references through the Go heap, Go globals, and Go stacks. It is true that Pebble violates the cgo rules about storing a Go pointer in C allocated memory. This is discussed at https://github.com/cockroachdb/pebble/blob/936e011bb911a82b8711196b7c042c7d57b7c59f/internal/cache/entry.go#L27-L41:
// entry holds the metadata for a cache entry. The memory for an entry is
// allocated from manually managed memory.
//
// Using manual memory management for entries is technically a volation of the
// Cgo pointer rules:
//
// https://golang.org/cmd/cgo/#hdr-Passing_pointers
//
// Specifically, Go pointers should not be stored in C allocated memory. The
// reason for this rule is that the Go GC will not look at C allocated memory
// to find pointers to Go objects. If the only reference to a Go object is
// stored in C allocated memory, the object will be reclaimed. The shard field
// of the entry struct points to a Go allocated object, thus the
// violation. What makes this "safe" is that the Cache guarantees that there
// are other pointers to the shard which will keep it alive.
Due to the violation of the Cgo pointer rules, running cockroach
with GODEBUG=cgocheck=2
isn't going to work. That said, we've never seen evidence that this violation of the Cgo pointer rules actually causes problems when you're not running with cgocheck=2
. As the comment explains, this violation appears to be copacetic as long as Go does not have a moving GC.
As Peter already explained, the side-effect of enabling cgocheck=2
is a write barrier which crashes the program whenever a Go pointer is attempted to be written into non-Go memory [1].
Right. It's an unfortunate side effect of this behavior that we cannot use the cgocheck=2 diagnostic to debug other issues like #82958 because of the false positives from this code path. But it doesn't seem like there's anything we can do about that. Thanks for the update.
@davepacheco I don't know if it's worth your while, but it should be possible to patch pebble to avoid allocating manually. Basically changing Free
to be a no-op and making New
just allocate using Go around here.
I'm not sure if that might have other unintended side effects, certainly wouldn't do it in Prod but it sounded like you were mostly using CRDB for running test suites so at least for figuring out whether cgocheck passes this could be worth a try.
@tbg Pebble already does that if you disable cgo (e.g. via setting CGO_ENABLED=0
during compilation). @davepacheco If you want to see how that is done, grep for cgo
in the internal/{cache,manual,rawalloc}
directories. Note that performance is a lot worse in this mode, but it may be adequate for small testing usage.
Describe the problem
While trying to track down #82958 and a raft of similar problems, I stumbled on GODEBUG=cgocheck=2 and I found that
cockroach
seems to crash with this set. Note that while most of my issues have been on illumos, this is easily reproduced for me on Linux.I know the code thinks its behavior is safe despite violating Go's rules here (which would make this a false positive, I guess). I'm not sure that's true -- see below.
To Reproduce
Then I started the single-node cluster again with the same arguments. This time it quickly bails out:
Here's the stderr:
Here's the highlight:
Environment:
See above for AMI info -- it's Ubuntu 22.04 LTS.
Additional context
Caveat: I'm not very experienced with Go myself. I gather that this failure means Go thinks this code is violating the rules for passing pointers. I see that the Pebble code is aware of the constraints here and it's believed safe because "the Cache guarantees that there are other pointers to the shard which will keep it alive". That seems to imply that the only risk here is that references to Go objects in C memory would be missed by the GC and potentially freed while still in use. But the rules also say:
There's some more information in this thread and this issue. These threads make me wonder what might happen if you're running in a configuration with non-Go-managed memory for the cache, and the memory is uninitialized, and it happens to contain Go pointers. Might the Go GC incorrectly follow these pointers? Could that not cause all kinds of problems, including cases where GC finds an apparently-referenced object that's actually free? (context: I'm seeing a bunch of crashes from the Go runtime memory allocator that say exactly that.)
Jira issue: CRDB-21719