Closed cockroach-teamcity closed 5 years ago
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xd9d4e2]
goroutine 173295 [running]:
panic(0x38b2420, 0x6798bc0)
/usr/local/go/src/runtime/panic.go:565 +0x2c5 fp=0xc003e53d18 sp=0xc003e53c88 pc=0x787af5
runtime.panicmem(...)
/usr/local/go/src/runtime/panic.go:82
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:390 +0x411 fp=0xc003e53d48 sp=0xc003e53d18 pc=0x79d651
github.com/cockroachdb/cockroach/vendor/golang.org/x/net/trace.(*traceSet).Remove(0x0, 0xc0074e5380)
/go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/net/trace/trace.go:478 +0x22 fp=0xc003e53d70 sp=0xc003e53d48 pc=0xd9d4e2
github.com/cockroachdb/cockroach/vendor/golang.org/x/net/trace.(*trace).Finish(0xc0074e5380)
/go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/net/trace/trace.go:412 +0x137 fp=0xc003e53e40 sp=0xc003e53d70 pc=0xd9d097
github.com/cockroachdb/cockroach/pkg/util/tracing.(*span).FinishWithOptions(0xc003869ce0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/tracer_span.go:316 +0x131 fp=0xc003e53ea8 sp=0xc003e53e40 pc=0xecdf91
github.com/cockroachdb/cockroach/pkg/util/tracing.(*span).Finish(0xc003869ce0)
/go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/tracer_span.go:300 +0x67 fp=0xc003e53f08 sp=0xc003e53ea8 pc=0xecde47
github.com/cockroachdb/cockroach/pkg/util/tracing.FinishSpan(0x45de8e0, 0xc003869ce0)
/go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/tracer.go:625 +0x40 fp=0xc003e53f20 sp=0xc003e53f08 pc=0xecc410
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*samplerProcessor).Run(0xc000674a80, 0x458cfe0, 0xc007fa6440)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/sampler.go:186 +0x13e fp=0xc003e53fa0 sp=0xc003e53f20 pc=0x25ca07e
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).startInternal.func1(0xc00b40ed20, 0xc00387ac60, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:563 +0x68 fp=0xc003e53fc8 sp=0xc003e53fa0 pc=0x25f5708
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc003e53fd0 sp=0xc003e53fc8 pc=0x7b7891
created by github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).startInternal
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:562 +0x35e
We see a crash due to a use-after-finish in a samplerProcessor
. We saw this twice yesterday in #39239 and in #39190. We had hoped it was fixed by #39203, but it doesn't look like it. Luckily, it looks easy to reproduce.
I can reproduce this fairly easily by running the roachtest with cockroach binary from master
. However, I'm having a little trouble reproducing it with DebugUseAfterFinish
set to true
, which would help track down where the use-after-free is coming from.
SHA: https://github.com/cockroachdb/cockroach/commits/3b9a95bd7eb2cfa6d544fe7217852a85ec3b76f4
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1422703&tab=buildLog
The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190805-1422703/kv/contention/nodes=4/run_1
cluster.go:1726,kv.go:197,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1564984076-36-n5cpu4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
stderr:
stdout:
write
54s 0 0.0 287.2 0.0 0.0 0.0 0.0 write
55s 0 128.0 284.3 9.4 7516.2 8321.5 34359.7 write
56s 0 0.0 279.2 0.0 0.0 0.0 0.0 write
57s 0 100.0 276.1 11.0 9663.7 10737.4 10737.4 write
58s 0 288.0 276.3 10.0 4563.4 10737.4 32212.3 write
59s 0 140.0 274.0 8.1 1744.8 12348.0 26843.5 write
1m0s 0 116.0 271.4 7.6 6710.9 6710.9 24696.1 write
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
1m1s 0 128.0 269.0 9.4 5905.6 9663.7 51539.6 write
1m2s 0 116.0 266.5 8.9 4563.4 8589.9 47244.6 write
1m3s 0 60.0 263.3 9.4 5637.1 10200.5 40802.2 write
Error: EOF
Error: exit status 1
: exit status 1
cluster.go:2090,kv.go:207,test_runner.go:691: unexpected node event: 2: dead
@nvanbenschoten are you still running this? If not, happy to take over.
I was looking at it last Friday but haven't picked it back up yet this week after failing to track it down. I'm happy to hand it off if you're interested! It's very easy to hit if you run this roachtest.
Ok, it's mine now. Stay tuned.
SHA: https://github.com/cockroachdb/cockroach/commits/3db89b230b0c41e399354cbeb78c1e82c8e30004
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=kv/contention/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1424320&tab=buildLog
The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190806-1424320/kv/contention/nodes=4/run_1
cluster.go:1726,kv.go:197,cluster.go:2069,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1565072334-33-n5cpu4:5 -- ./workload run kv --init --secondary-index --duration=1h0m0s --cycle-length=512 --concurrency=128 --batch=4 --splits=4 {pgurl:1-4} returned:
stderr:
stdout:
47s 0 588.4 360.4 16.8 1275.1 2080.4 38654.7 write
48s 0 105.0 355.1 8.9 805.3 1275.1 2550.1 write
49s 0 539.9 358.8 14.7 2147.5 3355.4 18253.6 write
50s 0 84.0 353.3 10.0 2281.7 2281.7 3355.4 write
51s 0 144.0 349.2 7.6 3355.4 10737.4 13421.8 write
52s 0 344.0 349.1 9.4 2684.4 6174.0 18253.6 write
53s 0 503.9 352.1 9.4 1342.2 4026.5 28991.0 write
54s 0 32.0 346.1 6.0 1208.0 42949.7 42949.7 write
55s 0 0.0 339.8 0.0 0.0 0.0 0.0 write
56s 0 12.0 334.0 13.1 6174.0 16643.0 16643.0 write
57s 0 88.0 329.7 7.6 5100.3 7784.6 10737.4 write
Error: EOF
Error: exit status 1
: exit status 1
cluster.go:2090,kv.go:207,test_runner.go:691: unexpected node event: 4: dead
Looking at this now. The panic usually occurs very soon after the test starts, so it really is quite easy to repro (haven't tried DebugUseAfterFinish
yet).
I'm afraid that I'm missing something, though. The stack traces I see always panic on the deferred call to Finish
and the panic is on net/trace
s traceSet.Remove
, looks like traceSet
is nil here because the panic comes from trying to lock the traceSet
's mutex in m.Remove()
.
This indicates that the family we're looking up simply isn't populated in activeTraces
. Long story short, I think we're more likely to be double-finishing the trace? The first time we do that, we'd reset the Family to the empty string:
https://github.com/cockroachdb/vendored/blob/master/golang.org/x/net/trace/trace.go#L741
so that the second call to Finish would run into the problem above.
A use-after-finish would be expected to panic when the trace is being used to log messages into. This also explains why @nvanbenschoten wasn't able to repro with DebugUseAfterFree
because with that flag set, we never reset the trace (note how we don't call reset()
below):
https://github.com/cockroachdb/vendored/blob/master/golang.org/x/net/trace/trace.go#L924
With that in mind, I think I can just instrument net/trace temporarily to retain the stack of the first finish and tell us about it when we crash.
Well that's disappointing. I made the change so that DebugUseAfterFree
would still call tr.reset()
and added a nice error message that would be printed prior to the crash. Guess what, it's not crashing any more. I sure hope the free list isn't important here.
^- I wasn't getting crashes because my vendor code wasn't picked up (https://github.com/cockroachdb/cockroach/issues/39392). Now it's crashing 100% of the time alerting me to Finish
being called more than once, and always from the sql subsystem.
A net/trace span's Finish()
method is definitely not idempotent (and can't be made idempotent since the trace goes right back on the free list), so it's a problem if we're calling it more than once.
Here's a condensed dump of the calls to Finish: https://gist.github.com/tbg/0fe09c77f0a39c07df445600867b0ef5
The TL;DR is that the first call is always distsqlrun.(*ProcessorBase).InternalClose
on top of a *sampleAggregator
or *samplerProcessor
, for example
[...]
github.com/cockroachdb/cockroach/pkg/util/tracing.FinishSpan(...)
/go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/tracer.go:625
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).InternalClose(0xc003e92a80, 0x2581b36)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:891 +0x140
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*samplerProcessor).close(0xc003e92a80)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/sampler.go:364 +0x31
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.newSamplerProcessor.func1(0x4586000, 0xc00773a330, 0x0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/sampler.go:157 +0x2a
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).moveToTrailingMeta(0xc003e92a80)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:755 +0x136
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).MoveToDraining(0xc003e92a80, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:657 +0x179
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*samplerProcessor).Run(0xc003e92a80, 0x4585f40, 0xc008c9ea40)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/sampler.go:185 +0x13d
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).startInternal.func1(0xc00327ed20, 0xc0087de800, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:563 +0x68
created by github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).startInternal
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:562 +0x35e
The second call is very often
github.com/cockroachdb/cockroach/pkg/util/tracing.FinishSpan(0x45d7760, 0xc007eaf500)
/go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/tracer.go:625 +0x40
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*samplerProcessor).Run(0xc000fc1500, 0x4585f40, 0xc0097d63c0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/sampler.go:186 +0x13e
though there's also
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*sampleAggregator).Run(0xc000580000, 0x4586000, 0xc003980780)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/sample_aggregator.go:149 +0x13e
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).Run(0xc00825ed20, 0x4586000, 0xc003980780, 0x3e5b2a8, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:623 +0x1f2
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc0001bd8c0, 0xc006dc5020, 0xc003762bd0, 0xc0038633b8, 0xc0073c0000, 0xc007ff9bd8, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:324 +0x368
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).planAndRunCreateStats(0xc0001bd8c0, 0x4586000, 0xc007d16600, 0xc007ff9bd8, 0xc006dc5020, 0xc003762bd0, 0xc00667e1c0, 0xc009996140, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_plan_stats.go:254 +0x26d
The first call (from sampler.go:157) is new via 63765fc0fb36e6464147336c64d1b7ce24174afb on July 28th, so chances are the problem was introduced there, but possibly it's not new at all.
I am noticing that InternalClose
is careful to avoid double-finishing the span:
but the second calls won't:
^- this one binds s.span
early, so even if something sets it to nil
, the defer will still use it
^- ditto
It's well possible that there is a systematic problem across the distsql code here that just hasn't cropped up so far.
My theory for why this has gone unnoticed for a while is that a) the problem needs net/trace enabled which is specific to the contention test:
and even then, traces aren't always aggressively recycle. When net/trace holds on to a trace span for display in the UI, obviously it can't be reused. Additionally, the free list has a bounded size, so we don't actually reset all spans. Plus, doubly calling Finish()
may work if somehow (not correctly, but at least not crashing) we're creating the empty string family before hitting the m.Lock()
on a nil m
. That is, most of the time, we probably reuse a family that has been reset in a new trace (before the old incarnation finishes the second time), and this initializes this empty family and prevents at least the specific crash we see here.
All in all, this looks like a SQL (distsql) bug to me now, so I hope @jordanlewis can take a look and route accordingly. To repro, you just make a branch and pin vendored
to https://github.com/cockroachdb/vendored/tree/delme-debug-tbg, and then run
git submodule update
./build/builder.sh mkrelease
./bin/roachtest run --count 5 --cockroach cockroach-linux-2.6.32-gnu-amd64 kv/contention/nodes=4
which should fail 100% of the time with total test durations of ~150s.
Very nice debugging!
My theory for why this has gone unnoticed for a while is that a) the problem needs net/trace enabled which is specific to the contention test
This was the missing piece that we were looking for. The behavior of the failure makes a lot more sense now.
Do you have suggestions for ways that we could make this kind of bug louder in the future? Should we enable DebugUseAfterFree
in race builds and upstream your patch in https://github.com/cockroachdb/vendored/commit/d545b62e79e32303d8b2e261a146bd54c5e3a1e8?
I haven't taken the time to understand this yet, but the two concrete issues you identify live in the sampler processors, so @rytaft would you be able to take this on for those two?
Taking a look -- thanks!
SHA: https://github.com/cockroachdb/cockroach/commits/175c5ada040fd0cbbf178636b1c551d5c2229ec4
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1417597&tab=buildLog