cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.04k stars 3.8k forks source link

roachtest: kv/contention/nodes=4 failed #39262

Closed cockroach-teamcity closed 5 years ago

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/175c5ada040fd0cbbf178636b1c551d5c2229ec4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1417597&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190802-1417597/kv/contention/nodes=4/run_1
    cluster.go:1726,kv.go:197,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564726582-33-n5cpu4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
        stderr:

        stdout:
            47s        0          553.3          515.9     11.0   1040.2   2281.7  15569.3 write
             48s        0          203.0          509.4     10.5    637.5   2550.1   4831.8 write
             49s        0          319.9          505.5      8.4   1409.3   4295.0  24696.1 write
             50s        0           64.0          496.7      7.6   1946.2   2281.7  16643.0 write
             51s        0          131.0          489.5     10.5   2080.4   3221.2  15569.3 write
             52s        0          265.0          485.2      7.1   2147.5   7247.8  13421.8 write
             53s        0          127.0          478.4      6.6   3355.4   3355.4   8053.1 write
             54s        0          121.0          471.8     12.1   3489.7  20401.1  24696.1 write
             55s        0          787.7          477.6     15.7   2013.3   5368.7  21474.8 write
             56s        0          363.9          475.5     11.0    604.0   3087.0  36507.2 write
        Error: ERROR: result is ambiguous (error=unable to dial n4: breaker open [exhausted]) (SQLSTATE 40003)
        Error:  exit status 1
        : exit status 1
    cluster.go:2090,kv.go:207,test_runner.go:691: unexpected node event: 4: dead
nvanbenschoten commented 5 years ago
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xd9d4e2]

goroutine 173295 [running]:
panic(0x38b2420, 0x6798bc0)
    /usr/local/go/src/runtime/panic.go:565 +0x2c5 fp=0xc003e53d18 sp=0xc003e53c88 pc=0x787af5
runtime.panicmem(...)
    /usr/local/go/src/runtime/panic.go:82
runtime.sigpanic()
    /usr/local/go/src/runtime/signal_unix.go:390 +0x411 fp=0xc003e53d48 sp=0xc003e53d18 pc=0x79d651
github.com/cockroachdb/cockroach/vendor/golang.org/x/net/trace.(*traceSet).Remove(0x0, 0xc0074e5380)
    /go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/net/trace/trace.go:478 +0x22 fp=0xc003e53d70 sp=0xc003e53d48 pc=0xd9d4e2
github.com/cockroachdb/cockroach/vendor/golang.org/x/net/trace.(*trace).Finish(0xc0074e5380)
    /go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/net/trace/trace.go:412 +0x137 fp=0xc003e53e40 sp=0xc003e53d70 pc=0xd9d097
github.com/cockroachdb/cockroach/pkg/util/tracing.(*span).FinishWithOptions(0xc003869ce0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/tracer_span.go:316 +0x131 fp=0xc003e53ea8 sp=0xc003e53e40 pc=0xecdf91
github.com/cockroachdb/cockroach/pkg/util/tracing.(*span).Finish(0xc003869ce0)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/tracer_span.go:300 +0x67 fp=0xc003e53f08 sp=0xc003e53ea8 pc=0xecde47
github.com/cockroachdb/cockroach/pkg/util/tracing.FinishSpan(0x45de8e0, 0xc003869ce0)
    /go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/tracer.go:625 +0x40 fp=0xc003e53f20 sp=0xc003e53f08 pc=0xecc410
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*samplerProcessor).Run(0xc000674a80, 0x458cfe0, 0xc007fa6440)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/sampler.go:186 +0x13e fp=0xc003e53fa0 sp=0xc003e53f20 pc=0x25ca07e
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).startInternal.func1(0xc00b40ed20, 0xc00387ac60, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:563 +0x68 fp=0xc003e53fc8 sp=0xc003e53fa0 pc=0x25f5708
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc003e53fd0 sp=0xc003e53fc8 pc=0x7b7891
created by github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).startInternal
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:562 +0x35e

We see a crash due to a use-after-finish in a samplerProcessor. We saw this twice yesterday in #39239 and in #39190. We had hoped it was fixed by #39203, but it doesn't look like it. Luckily, it looks easy to reproduce.

nvanbenschoten commented 5 years ago

I can reproduce this fairly easily by running the roachtest with cockroach binary from master. However, I'm having a little trouble reproducing it with DebugUseAfterFinish set to true, which would help track down where the use-after-free is coming from.

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/3b9a95bd7eb2cfa6d544fe7217852a85ec3b76f4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1422703&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190805-1422703/kv/contention/nodes=4/run_1
    cluster.go:1726,kv.go:197,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564984076-36-n5cpu4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
        stderr:

        stdout:
         write
             54s        0            0.0          287.2      0.0      0.0      0.0      0.0 write
             55s        0          128.0          284.3      9.4   7516.2   8321.5  34359.7 write
             56s        0            0.0          279.2      0.0      0.0      0.0      0.0 write
             57s        0          100.0          276.1     11.0   9663.7  10737.4  10737.4 write
             58s        0          288.0          276.3     10.0   4563.4  10737.4  32212.3 write
             59s        0          140.0          274.0      8.1   1744.8  12348.0  26843.5 write
            1m0s        0          116.0          271.4      7.6   6710.9   6710.9  24696.1 write
        _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
            1m1s        0          128.0          269.0      9.4   5905.6   9663.7  51539.6 write
            1m2s        0          116.0          266.5      8.9   4563.4   8589.9  47244.6 write
            1m3s        0           60.0          263.3      9.4   5637.1  10200.5  40802.2 write
        Error: EOF
        Error:  exit status 1
        : exit status 1
    cluster.go:2090,kv.go:207,test_runner.go:691: unexpected node event: 2: dead
tbg commented 5 years ago

@nvanbenschoten are you still running this? If not, happy to take over.

nvanbenschoten commented 5 years ago

I was looking at it last Friday but haven't picked it back up yet this week after failing to track it down. I'm happy to hand it off if you're interested! It's very easy to hit if you run this roachtest.

tbg commented 5 years ago

Ok, it's mine now. Stay tuned.

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/3db89b230b0c41e399354cbeb78c1e82c8e30004

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1424320&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190806-1424320/kv/contention/nodes=4/run_1
    cluster.go:1726,kv.go:197,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1565072334-33-n5cpu4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
        stderr:

        stdout:

             47s        0          588.4          360.4     16.8   1275.1   2080.4  38654.7 write
             48s        0          105.0          355.1      8.9    805.3   1275.1   2550.1 write
             49s        0          539.9          358.8     14.7   2147.5   3355.4  18253.6 write
             50s        0           84.0          353.3     10.0   2281.7   2281.7   3355.4 write
             51s        0          144.0          349.2      7.6   3355.4  10737.4  13421.8 write
             52s        0          344.0          349.1      9.4   2684.4   6174.0  18253.6 write
             53s        0          503.9          352.1      9.4   1342.2   4026.5  28991.0 write
             54s        0           32.0          346.1      6.0   1208.0  42949.7  42949.7 write
             55s        0            0.0          339.8      0.0      0.0      0.0      0.0 write
             56s        0           12.0          334.0     13.1   6174.0  16643.0  16643.0 write
             57s        0           88.0          329.7      7.6   5100.3   7784.6  10737.4 write
        Error: EOF
        Error:  exit status 1
        : exit status 1
    cluster.go:2090,kv.go:207,test_runner.go:691: unexpected node event: 4: dead
tbg commented 5 years ago

Looking at this now. The panic usually occurs very soon after the test starts, so it really is quite easy to repro (haven't tried DebugUseAfterFinish yet).

I'm afraid that I'm missing something, though. The stack traces I see always panic on the deferred call to Finish

https://github.com/cockroachdb/vendored/blob/740197b6070d682e67078b4306761b8bc4ff862c/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc/client.go#L55

and the panic is on net/traces traceSet.Remove, looks like traceSet is nil here because the panic comes from trying to lock the traceSet's mutex in m.Remove().

https://github.com/cockroachdb/vendored/blob/740197b6070d682e67078b4306761b8bc4ff862c/golang.org/x/net/trace/trace.go#L409-L412

This indicates that the family we're looking up simply isn't populated in activeTraces. Long story short, I think we're more likely to be double-finishing the trace? The first time we do that, we'd reset the Family to the empty string:

https://github.com/cockroachdb/vendored/blob/master/golang.org/x/net/trace/trace.go#L741

so that the second call to Finish would run into the problem above.

A use-after-finish would be expected to panic when the trace is being used to log messages into. This also explains why @nvanbenschoten wasn't able to repro with DebugUseAfterFree because with that flag set, we never reset the trace (note how we don't call reset() below):

https://github.com/cockroachdb/vendored/blob/master/golang.org/x/net/trace/trace.go#L924

With that in mind, I think I can just instrument net/trace temporarily to retain the stack of the first finish and tell us about it when we crash.

tbg commented 5 years ago

Well that's disappointing. I made the change so that DebugUseAfterFree would still call tr.reset() and added a nice error message that would be printed prior to the crash. Guess what, it's not crashing any more. I sure hope the free list isn't important here.

tbg commented 5 years ago

^- I wasn't getting crashes because my vendor code wasn't picked up (https://github.com/cockroachdb/cockroach/issues/39392). Now it's crashing 100% of the time alerting me to Finish being called more than once, and always from the sql subsystem.

A net/trace span's Finish() method is definitely not idempotent (and can't be made idempotent since the trace goes right back on the free list), so it's a problem if we're calling it more than once.

Here's a condensed dump of the calls to Finish: https://gist.github.com/tbg/0fe09c77f0a39c07df445600867b0ef5

The TL;DR is that the first call is always distsqlrun.(*ProcessorBase).InternalClose on top of a *sampleAggregator or *samplerProcessor, for example

[...]
github.com/cockroachdb/cockroach/pkg/util/tracing.FinishSpan(...)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/tracer.go:625
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).InternalClose(0xc003e92a80, 0x2581b36)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:891 +0x140
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*samplerProcessor).close(0xc003e92a80)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/sampler.go:364 +0x31
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.newSamplerProcessor.func1(0x4586000, 0xc00773a330, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/sampler.go:157 +0x2a
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).moveToTrailingMeta(0xc003e92a80)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:755 +0x136
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).MoveToDraining(0xc003e92a80, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:657 +0x179
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*samplerProcessor).Run(0xc003e92a80, 0x4585f40, 0xc008c9ea40)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/sampler.go:185 +0x13d
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).startInternal.func1(0xc00327ed20, 0xc0087de800, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:563 +0x68
created by github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).startInternal
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:562 +0x35e

The second call is very often

github.com/cockroachdb/cockroach/pkg/util/tracing.FinishSpan(0x45d7760, 0xc007eaf500)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/tracer.go:625 +0x40
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*samplerProcessor).Run(0xc000fc1500, 0x4585f40, 0xc0097d63c0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/sampler.go:186 +0x13e

though there's also

github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*sampleAggregator).Run(0xc000580000, 0x4586000, 0xc003980780)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/sample_aggregator.go:149 +0x13e
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).Run(0xc00825ed20, 0x4586000, 0xc003980780, 0x3e5b2a8, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:623 +0x1f2
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc0001bd8c0, 0xc006dc5020, 0xc003762bd0, 0xc0038633b8, 0xc0073c0000, 0xc007ff9bd8, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:324 +0x368
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).planAndRunCreateStats(0xc0001bd8c0, 0x4586000, 0xc007d16600, 0xc007ff9bd8, 0xc006dc5020, 0xc003762bd0, 0xc00667e1c0, 0xc009996140, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_plan_stats.go:254 +0x26d

The first call (from sampler.go:157) is new via 63765fc0fb36e6464147336c64d1b7ce24174afb on July 28th, so chances are the problem was introduced there, but possibly it's not new at all.

I am noticing that InternalClose is careful to avoid double-finishing the span:

https://github.com/cockroachdb/cockroach/blob/0bc68c191c7d63ca60ca4e9d84af4ad0e67f1125/pkg/sql/distsqlrun/processors.go#L881-L899

but the second calls won't:

https://github.com/cockroachdb/cockroach/blob/63765fc0fb36e6464147336c64d1b7ce24174afb/pkg/sql/distsqlrun/sampler.go#L175

^- this one binds s.span early, so even if something sets it to nil, the defer will still use it

https://github.com/cockroachdb/cockroach/blob/91f48bdaecf8940ef6449b36c41ae3ede6a483c2/pkg/sql/distsqlrun/sample_aggregator.go#L138

^- ditto

It's well possible that there is a systematic problem across the distsql code here that just hasn't cropped up so far.

My theory for why this has gone unnoticed for a while is that a) the problem needs net/trace enabled which is specific to the contention test:

https://github.com/cockroachdb/cockroach/blob/5082052d02a5ab5f17a8f253f502ea2be52405fa/pkg/cmd/roachtest/kv.go#L174

and even then, traces aren't always aggressively recycle. When net/trace holds on to a trace span for display in the UI, obviously it can't be reused. Additionally, the free list has a bounded size, so we don't actually reset all spans. Plus, doubly calling Finish() may work if somehow (not correctly, but at least not crashing) we're creating the empty string family before hitting the m.Lock() on a nil m. That is, most of the time, we probably reuse a family that has been reset in a new trace (before the old incarnation finishes the second time), and this initializes this empty family and prevents at least the specific crash we see here.

All in all, this looks like a SQL (distsql) bug to me now, so I hope @jordanlewis can take a look and route accordingly. To repro, you just make a branch and pin vendored to https://github.com/cockroachdb/vendored/tree/delme-debug-tbg, and then run

git submodule update
./build/builder.sh mkrelease
./bin/roachtest run --count 5 --cockroach cockroach-linux-2.6.32-gnu-amd64 kv/contention/nodes=4

which should fail 100% of the time with total test durations of ~150s.

nvanbenschoten commented 5 years ago

Very nice debugging!

My theory for why this has gone unnoticed for a while is that a) the problem needs net/trace enabled which is specific to the contention test

This was the missing piece that we were looking for. The behavior of the failure makes a lot more sense now.

Do you have suggestions for ways that we could make this kind of bug louder in the future? Should we enable DebugUseAfterFree in race builds and upstream your patch in https://github.com/cockroachdb/vendored/commit/d545b62e79e32303d8b2e261a146bd54c5e3a1e8?

jordanlewis commented 5 years ago

I haven't taken the time to understand this yet, but the two concrete issues you identify live in the sampler processors, so @rytaft would you be able to take this on for those two?

rytaft commented 5 years ago

Taking a look -- thanks!