cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.81k forks source link

build: allow for building with asan / msan #97246

Open nicktrav opened 1 year ago

nicktrav commented 1 year ago

Is your feature request related to a problem? Please describe.

Memory and address sanitization is useful for identifying subtle issues that manifest only at runtime. The Go runtime supports both via the -msan and -asan build flags.

Describe the solution you'd like

Consider adding support for building Cockroach binaries with these features enabled. This will require some changes to the build system in order to support.

Given there is a performance penalty for running with these runtime features enabled, production binaries should not be built with -msan / -asan. Instead, we could consider offering a debug binary and container image. Short of this, we already have the ability to produce one-off builds via TeamCity. The existing pipelines should have some way of enabling ASAN / MSAN in the resultant artifacts.

Additional context

More context can be found in this issue (internal) that possibly could have benefited from having address and memory sanitization, in order to narrow down the cause of some memory corruption encountered in the wild.

Jira issue: CRDB-24594

srosenberg commented 1 year ago

So far my initial attempt to brute force it via mkrelease amd64-linux-gnu has run into a wall.

[1] https://github.com/golang/go/blob/9f834a559c9ed6cdf883e29b36e21e5f956df74f/src/cmd/go/internal/work/init.go#L439

srosenberg commented 1 year ago

Not sure why overflow tracking of globals has to be disabled [1], but it does appear to work without it,

ASAN_OPTIONS=report_globals=0 ./cockroach-linux-2.6.32-gnu-amd64 version
Build Tag:        v22.2.3-dirty
Build Time:       2023/02/17 04:42:08
Distribution:     CCL
Platform:         linux amd64 (x86_64-unknown-linux-gnu)
Go Version:       go1.19.1
C Compiler:       gcc 9.4.0
Build Commit ID:  81a114c2bc2ef8ef76fe3809d0469319c7e82635
Build Type:       release

My primitive efforts to cause memory corruption via strace,

strace -f -e inject=mmap:retval=232849384973 -p $pid

confirm that ASan is running, although in this case it offers less insight than the runtime :)

I230217 06:27:56.124414 1 util/log/file_sync_buffer.go:238 ⋮ [config]   file created at: 2023/02/17 06:27:56
I230217 06:27:56.124448 1 util/log/file_sync_buffer.go:238 ⋮ [config]   running on machine: ‹gceworker-srosenberg-2›
I230217 06:27:56.124472 1 util/log/file_sync_buffer.go:238 ⋮ [config]   binary: CockroachDB CCL v22.2.3-dirty (x86_64-unknown-linux-gnu, built 2023/02/17 04:42:08, go1.19.1)
I230217 06:27:56.124495 1 util/log/file_sync_buffer.go:238 ⋮ [config]   arguments: [‹./cockroach-linux-2.6.32-gnu-amd64› ‹start-single-node› ‹--insecure›]
I230217 06:27:56.124528 1 util/log/file_sync_buffer.go:238 ⋮ [config]   log format (utf8=✓): crdb-v2
I230217 06:27:56.124543 1 util/log/file_sync_buffer.go:238 ⋮ [config]   line format: [IWEF]yymmdd hh:mm:ss.uuuuuu goid [chan@]file:line redactionmark \[tags\] [counter] msg
I230217 06:27:56.124179 1 util/log/flags.go:211  [-] 1  stderr capture started
==4042800==AddressSanitizer CHECK failed: ../../../../src/libsanitizer/sanitizer_common/sanitizer_allocator_primary64.h:636 "((beg)) == ((mapped))" (0x623000030000, 0x3636e7a60d)
AddressSanitizer:DEADLYSIGNAL
=================================================================
runtime: mmap(0x10c005c00000, 4194304) returned 0x3636e7a60d, 0
fatal error: runtime: cannot map pages in arena address space

runtime stack:
runtime.throw({0x8637c22?, 0x2430?})
        /usr/local/go/src/runtime/panic.go:1047 +0x5d fp=0x7fe319a44858 sp=0x7fe319a44828 pc=0x48dadd
runtime.sysMapOS(0x10c005c00000, 0x400000?)
        /usr/local/go/src/runtime/mem_linux.go:191 +0x10a fp=0x7fe319a448a0 sp=0x7fe319a44858 pc=0x46c38a

Note, the following technique is actually fairly effective.

Start single-node

ASAN_OPTIONS=report_globals=0 ./cockroach-linux-2.6.32-gnu-amd64 start-single-node --insecure

Find some memory in the address space

pmap `pgrep cockroach` |grep rw

E.g., we could pick anon which is likely to be zeroed (i.e., high probability of nil-dereference) or a page that belongs to libnss_files-2.31.

00007f04894fa000   8448K rw---   [ anon ]
00007f0489d47000      4K rw--- libnss_files-2.31.so

Run TPCC workload

It will result in calls to mmap.

./cockroach-linux-2.6.32-gnu-amd64 workload init tpcc

Inject failure

Suppresses every 1 + 10*k call to mmap and instead returns a memory address which we picked above.

strace -f -e inject=mmap:retval=139657314369536:when=1+10 -p $pid

The above will trigger runtime failures as well as Asan failures. E.g.,

==44210==AddressSanitizer CHECK failed: ../../../../src/libsanitizer/sanitizer_common/sanitizer_allocator_primary64.h:636 "((beg)) == ((mapped))" (0x62d0019a0000, 0x7f68426d5008)
==44210==ERROR: AddressSanitizer failed to deallocate 0x1000 (4096) bytes at address 0x7f68426d5008
==44210==AddressSanitizer CHECK failed: ../../../../src/libsanitizer/sanitizer_common/sanitizer_posix.cc:60 "(("unable to unmap" && 0)) != (0)" (0x0, 0x0)

[1] https://github.com/google/sanitizers/wiki/AddressSanitizerFlags

srosenberg commented 1 year ago

I did manage to build a docker container with the ASan instrumented binary. However, I took several shortcuts,