Closed cockroach-teamcity closed 3 years ago
| Error: driver: bad connection
Moving to SQL-Queries for the same reasons as the master sibling, see https://github.com/cockroachdb/cockroach/issues/69545#issuecomment-918114578
This does not seem like a blocker, but I'm also not sure this is a SQL Queries issue. @rafiss you seem to have the most familiarity with this issue -- can you comment on this?
Sure - we have seen the same type of failure in #68574. That was (supposed to be) fixed by #68665 which is because of a lib/pq issue. I don't yet know how it still could be happening, but I'll look more closely again.
@rafiss did you get a chance to take a look at this at all?
I did look for a few hours and couldn't repro or find anything helpful yet.
Have you tried reproducing via something like GCE_PROJECT=andrei-jepsen COUNT=50 SHORT=y LOCAL=n pkg/cmd/roachtest/roachstress.sh
while on 99a4816fc272228a63df20dae3cc41d235e705f3? Anecdotally this seems to happen every ~10-20 days so 50 iterations should give you something (or suggest that somehow, the problem has disappeared).
I ran with roachstress locally and with a smaller count; I didn't know I could use a GCE project. I'll try that
I spent a couple hours pouring over the artifacts but couldn't find anything. The only idea I had was to come up with some way for the client to signal all the nodes in the database to panic all at once when an unexpected error like this occurs, with the understanding that panic will dump a core when GOTRACEBACK=crash is specified.
FYI using this test as a guinea pig for stressing a roachtest in CI: https://github.com/cockroachdb/cockroach/pull/70435#issuecomment-923062021
Will let you know if that turns up a repro.
@cucaroach when the roachtest times out, the test harness sends SIGSEGV to ~everything to get stack traces, and we already collect the cores (though not as artifacts; with --debug
you'll find them in /mnt/data1/cores
or thereabouts).
Seems difficult to trigger this at the right time though, won't the connection already be gone by the time you realize something's amiss? Maybe running workload
with pretty intense debugging can help? Anyway, first order of the day is to see if this even repros any more I guess.
@tbg, thanks I didn't know that. I have some changes I'm playing with to re-enable core gathering and enable gzip compression of cores in order to streamline OOM investigations (#69317). Maybe --debug could drive core gathering as well if we don't want to do it all the time? They usually compress very well. I'm assuming that in cases like this even thought we got a "bad connection" the node is still running because there doesn't appear to be anything amiss in the logs but yeah any hints about why the connection died could be long gone by the the core is generated but who knows we might get lucky.
With GCE_PROJECT=andrei-jepsen COUNT=50 SHORT=y LOCAL=n pkg/cmd/roachtest/roachstress.sh
I finally got a repro on the second to last test run (out of 50 total). So @tbg's estimate of the flake rate is pretty spot on. Will try to look at the artifacts again to figure anything out
roachtest.ycsb/F/nodes=3 failed with artifacts on release-21.2 @ c46f5a5a098577b936e56f03d20c97300b4cce61:
| 113.0s 0 3490.9 2734.8 1.8 39.8 113.2 218.1 read
| 113.0s 0 3525.9 2737.1 15.2 41.9 75.5 285.2 readModifyWrite
| 114.0s 0 3548.0 2741.9 1.7 31.5 121.6 192.9 read
| 114.0s 0 3646.0 2745.1 14.2 41.9 100.7 243.3 readModifyWrite
| 115.0s 0 3466.5 2748.2 2.0 31.5 121.6 184.5 read
| 115.0s 0 3471.5 2751.4 16.8 41.9 79.7 243.3 readModifyWrite
| 116.0s 0 3219.9 2752.3 2.2 39.8 96.5 151.0 read
| 116.0s 0 3255.9 2755.7 18.9 48.2 71.3 243.3 readModifyWrite
| 117.0s 0 3455.5 2758.3 1.8 29.4 134.2 251.7 read
| 117.0s 0 3517.5 2762.2 15.7 41.9 88.1 352.3 readModifyWrite
| 118.0s 0 3765.9 2766.8 1.9 25.2 100.7 218.1 read
| 118.0s 0 3700.9 2770.2 16.3 41.9 75.5 251.7 readModifyWrite
| 119.0s 0 3563.3 2773.5 1.8 39.8 125.8 226.5 read
| 119.0s 0 3655.3 2777.6 15.2 41.9 83.9 218.1 readModifyWrite
| 120.0s 0 3635.6 2780.7 2.0 35.7 92.3 151.0 read
| 120.0s 0 3704.6 2785.4 16.8 44.0 65.0 184.5 readModifyWrite
Wraps: (4) exit status 20
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError
monitor.go:128,ycsb.go:73,ycsb.go:90,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
(1) attached stack trace
-- stack trace:
| main.(*monitorImpl).WaitE
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
| main.(*monitorImpl).Wait
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func1
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:73
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func2
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:90
| main.(*testRunner).runTest.func2
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
Wraps: (2) monitor failure
Wraps: (3) attached stack trace
-- stack trace:
| main.(*monitorImpl).wait.func2
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
Wraps: (4) monitor task failed
Wraps: (5) attached stack trace
-- stack trace:
| main.init
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
| runtime.doInit
| /usr/local/go/src/runtime/proc.go:6309
| runtime.main
| /usr/local/go/src/runtime/proc.go:208
| runtime.goexit
| /usr/local/go/src/runtime/asm_amd64.s:1371
Wraps: (6) t.Fatal() was called
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md)
- #69545 roachtest: ycsb/F/nodes=3/cpu=32 failed [bad connection] [similar to 69523] [C-test-failure O-roachtest O-robot T-sql-experience T-sql-queries branch-master] - #69523 roachtest: ycsb/F/nodes=3 failed [bad connection] [C-test-failure O-roachtest O-robot branch-master]
/cc @cockroachdb/kv-triage
Something I noticed the other day while using the debugger on some local unit test was that if you pause the process for a while, you can also get a bad connection
from pgx (I think). So I wonder if there's some timeout in pgx that, if passes, abandons the connection.
Worth looking into -- though to clarify, our ycsb
workload does not use pgx. Only database/sql
+lib/pq
.
roachtest.ycsb/F/nodes=3 failed with artifacts on release-21.2 @ 24021ba163e4ac438b169d575cf1527a4aae394d:
| 113.0s 0 4498.9 3766.9 2.0 15.2 35.7 60.8 read
| 113.0s 0 4608.9 3757.8 14.7 35.7 46.1 96.5 readModifyWrite
| 114.0s 0 4275.2 3771.3 2.1 14.7 30.4 79.7 read
| 114.0s 0 4286.2 3762.5 16.3 39.8 52.4 121.6 readModifyWrite
| 115.0s 0 4659.6 3779.0 2.2 18.9 44.0 83.9 read
| 115.0s 0 4606.6 3769.8 14.2 33.6 48.2 79.7 readModifyWrite
| 116.0s 0 4673.0 3786.7 2.2 17.8 39.8 71.3 read
| 116.0s 0 4769.0 3778.4 14.2 32.5 44.0 75.5 readModifyWrite
| 117.0s 0 4760.2 3795.1 2.1 17.8 39.8 71.3 read
| 117.0s 0 4729.2 3786.6 14.2 33.6 46.1 71.3 readModifyWrite
| 118.0s 0 4603.0 3801.9 2.2 16.8 32.5 65.0 read
| 118.0s 0 4775.9 3794.9 14.7 33.6 46.1 67.1 readModifyWrite
| 119.0s 0 4253.1 3805.7 2.4 17.8 39.8 83.9 read
| 119.0s 0 4334.1 3799.5 15.7 39.8 56.6 79.7 readModifyWrite
| 120.0s 0 4339.4 3810.2 2.4 16.3 37.7 79.7 read
| 120.0s 0 4280.5 3803.5 16.8 39.8 50.3 75.5 readModifyWrite
Wraps: (4) exit status 20
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError
monitor.go:128,ycsb.go:73,ycsb.go:90,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
(1) attached stack trace
-- stack trace:
| main.(*monitorImpl).WaitE
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
| main.(*monitorImpl).Wait
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func1
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:73
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func2
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:90
| main.(*testRunner).runTest.func2
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
Wraps: (2) monitor failure
Wraps: (3) attached stack trace
-- stack trace:
| main.(*monitorImpl).wait.func2
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
Wraps: (4) monitor task failed
Wraps: (5) attached stack trace
-- stack trace:
| main.init
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
| runtime.doInit
| /usr/local/go/src/runtime/proc.go:6309
| runtime.main
| /usr/local/go/src/runtime/proc.go:208
| runtime.goexit
| /usr/local/go/src/runtime/asm_amd64.s:1371
Wraps: (6) t.Fatal() was called
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md)
- #69545 roachtest: ycsb/F/nodes=3/cpu=32 failed [bad connection] [similar to 69523] [C-test-failure O-roachtest O-robot T-sql-experience T-sql-queries branch-master] - #69523 roachtest: ycsb/F/nodes=3 failed [bad connection] [C-test-failure O-roachtest O-robot branch-master]
/cc @cockroachdb/kv-triage
Any updates to report for the triage meeting?
I'm wondering if it still could be https://github.com/lib/pq/pull/1000
I previously made https://github.com/cockroachdb/cockroach/pull/68665 to handle this, but maybe the
if ctx.Err() != nil && (errors.Is(err, ctx.Err()) || errors.Is(err, driver.ErrBadConn))
check is too strict. perhaps it should be
if (ctx.Err() != nil && errors.Is(err, ctx.Err()) || errors.Is(err, driver.ErrBadConn)
The reason it could be that is because the error always happens at 120.0s after the test starts -- which is how long the ramp-up period for this test is. After the ramp period, the context is cancelled and a new context is created. see https://github.com/cockroachdb/cockroach/blob/b50be15e0d7aa15da242d0e3949e14d65b798010/pkg/workload/cli/run.go#L443-L477
So you're suggesting we ignore ErrBadConn
"always"? That seems wrong, won't this be the error you get if the node is down etc? If I understand the issue correctly it also seems like the best we can do, right? When the soon-to-be-canceled connection gets returned to the connection pool, whoever picks it up will not have a canceled context, so we're essentially checking the wrong context for ctx.Err()
. Silly question, why can't we pick up the PR you linked in lib/pq
and be done with it?
So you're suggesting we ignore ErrBadConn "always"?
Yeah, my thinking was that that there might be a weird case where the client hasn't seen the context cancelled
but the connection has been closed already. Wasn't able to repro that in any smaller examples though.
That seems wrong, won't this be the error you get if the node is down etc?
ErrBadConn is the error you get if the connection is unexpectedly broken, which maybe could include the node being down. But I'm pretty sure you'd see a timeout first.
Silly question, why can't we pick up the PR you linked in lib/pq and be done with it?
We already have picked up the linked PR. https://github.com/lib/pq/pull/1000 first appears in v1.9.0 and we are now using v1.10.2. To be clear, https://github.com/lib/pq/pull/1000 is the PR that caused the behavior of ErrBadConn
getting returned when the context has been canceled and you try to use a connection (see https://github.com/lib/pq/pull/1000#issuecomment-738920726)
I added the following diff so I could get a stack trace of which worker was actually getting ErrBadConn
:
diff --git a/pkg/workload/cli/run.go b/pkg/workload/cli/run.go
index c1d034acb1..6c842f19ce 100644
--- a/pkg/workload/cli/run.go
+++ b/pkg/workload/cli/run.go
@@ -277,6 +277,8 @@ func workerRun(
// that has been canceled. See https://github.com/lib/pq/pull/1000
return
}
+ log.Infof(ctx, "err: %+v\nctxErr: %+v", err, ctx.Err())
+ panic(err)
errCh <- err
continue
}
After running roachstress.sh for a while:
I210928 19:30:48.157155 1821 workload/cli/run.go:280 [-] 6 err: driver: bad connection
I210928 19:30:48.157155 1821 workload/cli/run.go:280 [-] 6 +(1) driver: bad connection
I210928 19:30:48.157155 1821 workload/cli/run.go:280 [-] 6 +Error types: (1) *errors.errorString
I210928 19:30:48.157155 1821 workload/cli/run.go:280 [-] 6 +ctxErr: <nil>
panic: driver: bad connection
goroutine 1821 [running]:
panic(0x20d24e0, 0xc0000ca270)
/usr/local/go/src/runtime/panic.go:1065 +0x565 fp=0xc000363eb0 sp=0xc000363de8 pc=0x462d45
github.com/cockroachdb/cockroach/pkg/workload/cli.workerRun(0x4a6f8b0, 0xc001110740, 0xc0004e6240, 0xc00073fb00, 0x0, 0xc000e36620)
/go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:281 +0x3f0 fp=0xc000363f58 sp=0xc000363eb0 pc=0x18683d0
github.com/cockroachdb/cockroach/pkg/workload/cli.runRun.func3.1(0xc00012a010, 0xc000996040, 0xc0004e6240, 0x0, 0x4a6f8b0, 0xc001110740, 0xc00073fb00, 0x74
, 0xc000e36620)
/go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:477 +0xec fp=0xc000363f98 sp=0xc000363f58 pc=0x186c9ec
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc000363fa0 sp=0xc000363f98 pc=0x49d281
created by github.com/cockroachdb/cockroach/pkg/workload/cli.runRun.func3
/go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:467 +0x111
That corresponds to this line: https://github.com/cockroachdb/cockroach/blob/bdb4c1ac404a99cb679643d734f292b81487e15a/pkg/workload/cli/run.go#L475
The notable thing here is that it's coming from the non-cancelled workersCtx
, so it makes sense that ctx.Err()
would be nil. It seems like a bad connection remained in the pool, and for some reason when the worker tried using it, it couldn't try to find a different connection right away. (As per the sql/driver docs, the sql package is supposed to retry on a new connection when it sees ErrBadConn.
I'm going to try to see where exactly ErrBadConn is coming from...
I wrote this quick test based on what workload
is doing https://gist.github.com/rafiss/698d0bdbeacfd225437db631d1a95e85
It has non-deterministic behavior. Sometimes the test passes, but sometimes it gets a transaction rolled back
error where it's expecting ErrBadConn. (It changes when I enable debugging and add breakpoints.) Also, if I run it with make stress
, there's a different failure entirely (it gets a "read timed out" error).
I have a non-YCSB repro and was able to reproduce the same behavior on 21.1. So this is almost certainly a workload-side problem, and if not that it is definitely not a new CRDB v21.2 bug.
It seems like a race condition in lib/pq. The test below fails somewhere around 50% of the time. It fails much less frequently when I run against a local CRDB single-node cluster. But it fails more reliably if I run against a 3-node roachprod cluster. If I add a time.Sleep
after the workerRun(rampCtx, errCh, nil /* wg */, nil, workFn, debugFn)
line, then I'm not able to get it to fail at all.
So I'll remove the GA-blocker tag since it shouldn't block any release. But I'll keep working on this to prevent this from flaking and will backport whatever workload fix I make to release-21.2
roachtest.ycsb/F/nodes=3/cpu=32 failed with artifacts on release-21.2 @ 99a4816fc272228a63df20dae3cc41d235e705f3:
Reproduce
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md)
Same failure on other branches
- #69545 roachtest: ycsb/F/nodes=3/cpu=32 failed [bad connection] [similar to 69523] [C-test-failure O-roachtest O-robot branch-master]
/cc @cockroachdb/kv-triage
This test on roachdash | Improve this report!