roachtest: ycsb/F/nodes=3/cpu=32 failed

cockroach-teamcity commented 3 years ago

roachtest.ycsb/F/nodes=3/cpu=32 failed with artifacts on release-21.2 @ 99a4816fc272228a63df20dae3cc41d235e705f3:

          |   113.0s        0         6523.2         4500.1      1.0     26.2    184.5    285.2 read
          |   113.0s        0         6384.2         4499.8     10.0     22.0    109.1    352.3 readModifyWrite
          |   114.0s        0         6008.2         4513.3      1.0     18.9    209.7    302.0 read
          |   114.0s        0         6136.2         4514.2      9.4     19.9    125.8    419.4 readModifyWrite
          |   115.0s        0         6031.3         4526.6      1.0     22.0    192.9    268.4 read
          |   115.0s        0         6008.3         4527.2      9.4     22.0    125.8    671.1 readModifyWrite
          |   116.0s        0         6414.7         4542.8      1.0     21.0    176.2    285.2 read
          |   116.0s        0         6269.7         4542.2      9.4     21.0    121.6    771.8 readModifyWrite
          |   117.0s        0         6044.7         4555.7      1.0     18.9    201.3    302.0 read
          |   117.0s        0         6015.7         4554.8      9.4     21.0    142.6    838.9 readModifyWrite
          |   118.0s        0         6215.2         4569.7      1.0     24.1    184.5    260.0 read
          |   118.0s        0         6105.2         4567.9      9.4     22.0    151.0    838.9 readModifyWrite
          |   119.0s        0         6350.0         4584.7      1.0     23.1    201.3    302.0 read
          |   119.0s        0         6230.1         4581.9      9.4     21.0    142.6    671.1 readModifyWrite
          |   120.0s        0         6071.1         4597.1      1.0     14.7    234.9    369.1 read
          |   120.0s        0         6283.9         4596.1      9.4     21.0    176.2    704.6 readModifyWrite
        Wraps: (4) exit status 20
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

    monitor.go:128,ycsb.go:73,ycsb.go:90,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func1
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:73
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:90
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6309
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:208
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1371
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Reproduce

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md)

Same failure on other branches

- #69545 roachtest: ycsb/F/nodes=3/cpu=32 failed [bad connection] [similar to 69523] [C-test-failure O-roachtest O-robot branch-master]

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

tbg commented 3 years ago

      | Error: driver: bad connection

tbg commented 3 years ago

Moving to SQL-Queries for the same reasons as the master sibling, see https://github.com/cockroachdb/cockroach/issues/69545#issuecomment-918114578

rytaft commented 3 years ago

This does not seem like a blocker, but I'm also not sure this is a SQL Queries issue. @rafiss you seem to have the most familiarity with this issue -- can you comment on this?

rafiss commented 3 years ago

Sure - we have seen the same type of failure in #68574. That was (supposed to be) fixed by #68665 which is because of a lib/pq issue. I don't yet know how it still could be happening, but I'll look more closely again.

rytaft commented 3 years ago

@rafiss did you get a chance to take a look at this at all?

rafiss commented 3 years ago

I did look for a few hours and couldn't repro or find anything helpful yet.

tbg commented 3 years ago

Have you tried reproducing via something like GCE_PROJECT=andrei-jepsen COUNT=50 SHORT=y LOCAL=n pkg/cmd/roachtest/roachstress.sh while on 99a4816fc272228a63df20dae3cc41d235e705f3? Anecdotally this seems to happen every ~10-20 days so 50 iterations should give you something (or suggest that somehow, the problem has disappeared).

rafiss commented 3 years ago

I ran with roachstress locally and with a smaller count; I didn't know I could use a GCE project. I'll try that

cucaroach commented 3 years ago

I spent a couple hours pouring over the artifacts but couldn't find anything. The only idea I had was to come up with some way for the client to signal all the nodes in the database to panic all at once when an unexpected error like this occurs, with the understanding that panic will dump a core when GOTRACEBACK=crash is specified.

tbg commented 3 years ago

FYI using this test as a guinea pig for stressing a roachtest in CI: https://github.com/cockroachdb/cockroach/pull/70435#issuecomment-923062021

Will let you know if that turns up a repro.

@cucaroach when the roachtest times out, the test harness sends SIGSEGV to ~everything to get stack traces, and we already collect the cores (though not as artifacts; with --debug you'll find them in /mnt/data1/cores or thereabouts). Seems difficult to trigger this at the right time though, won't the connection already be gone by the time you realize something's amiss? Maybe running workload with pretty intense debugging can help? Anyway, first order of the day is to see if this even repros any more I guess.

cucaroach commented 3 years ago

@tbg, thanks I didn't know that. I have some changes I'm playing with to re-enable core gathering and enable gzip compression of cores in order to streamline OOM investigations (#69317). Maybe --debug could drive core gathering as well if we don't want to do it all the time? They usually compress very well. I'm assuming that in cases like this even thought we got a "bad connection" the node is still running because there doesn't appear to be anything amiss in the logs but yeah any hints about why the connection died could be long gone by the the core is generated but who knows we might get lucky.

rafiss commented 3 years ago

With GCE_PROJECT=andrei-jepsen COUNT=50 SHORT=y LOCAL=n pkg/cmd/roachtest/roachstress.sh I finally got a repro on the second to last test run (out of 50 total). So @tbg's estimate of the flake rate is pretty spot on. Will try to look at the artifacts again to figure anything out

cockroach-teamcity commented 3 years ago

roachtest.ycsb/F/nodes=3 failed with artifacts on release-21.2 @ c46f5a5a098577b936e56f03d20c97300b4cce61:

          |   113.0s        0         3490.9         2734.8      1.8     39.8    113.2    218.1 read
          |   113.0s        0         3525.9         2737.1     15.2     41.9     75.5    285.2 readModifyWrite
          |   114.0s        0         3548.0         2741.9      1.7     31.5    121.6    192.9 read
          |   114.0s        0         3646.0         2745.1     14.2     41.9    100.7    243.3 readModifyWrite
          |   115.0s        0         3466.5         2748.2      2.0     31.5    121.6    184.5 read
          |   115.0s        0         3471.5         2751.4     16.8     41.9     79.7    243.3 readModifyWrite
          |   116.0s        0         3219.9         2752.3      2.2     39.8     96.5    151.0 read
          |   116.0s        0         3255.9         2755.7     18.9     48.2     71.3    243.3 readModifyWrite
          |   117.0s        0         3455.5         2758.3      1.8     29.4    134.2    251.7 read
          |   117.0s        0         3517.5         2762.2     15.7     41.9     88.1    352.3 readModifyWrite
          |   118.0s        0         3765.9         2766.8      1.9     25.2    100.7    218.1 read
          |   118.0s        0         3700.9         2770.2     16.3     41.9     75.5    251.7 readModifyWrite
          |   119.0s        0         3563.3         2773.5      1.8     39.8    125.8    226.5 read
          |   119.0s        0         3655.3         2777.6     15.2     41.9     83.9    218.1 readModifyWrite
          |   120.0s        0         3635.6         2780.7      2.0     35.7     92.3    151.0 read
          |   120.0s        0         3704.6         2785.4     16.8     44.0     65.0    184.5 readModifyWrite
        Wraps: (4) exit status 20
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

    monitor.go:128,ycsb.go:73,ycsb.go:90,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func1
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:73
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:90
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6309
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:208
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1371
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Reproduce

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md)

Same failure on other branches

- #69545 roachtest: ycsb/F/nodes=3/cpu=32 failed [bad connection] [similar to 69523] [C-test-failure O-roachtest O-robot T-sql-experience T-sql-queries branch-master] - #69523 roachtest: ycsb/F/nodes=3 failed [bad connection] [C-test-failure O-roachtest O-robot branch-master]

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

tbg commented 3 years ago

Something I noticed the other day while using the debugger on some local unit test was that if you pause the process for a while, you can also get a bad connection from pgx (I think). So I wonder if there's some timeout in pgx that, if passes, abandons the connection.

rafiss commented 3 years ago

Worth looking into -- though to clarify, our ycsb workload does not use pgx. Only database/sql+lib/pq.

cockroach-teamcity commented 3 years ago

roachtest.ycsb/F/nodes=3 failed with artifacts on release-21.2 @ 24021ba163e4ac438b169d575cf1527a4aae394d:

          |   113.0s        0         4498.9         3766.9      2.0     15.2     35.7     60.8 read
          |   113.0s        0         4608.9         3757.8     14.7     35.7     46.1     96.5 readModifyWrite
          |   114.0s        0         4275.2         3771.3      2.1     14.7     30.4     79.7 read
          |   114.0s        0         4286.2         3762.5     16.3     39.8     52.4    121.6 readModifyWrite
          |   115.0s        0         4659.6         3779.0      2.2     18.9     44.0     83.9 read
          |   115.0s        0         4606.6         3769.8     14.2     33.6     48.2     79.7 readModifyWrite
          |   116.0s        0         4673.0         3786.7      2.2     17.8     39.8     71.3 read
          |   116.0s        0         4769.0         3778.4     14.2     32.5     44.0     75.5 readModifyWrite
          |   117.0s        0         4760.2         3795.1      2.1     17.8     39.8     71.3 read
          |   117.0s        0         4729.2         3786.6     14.2     33.6     46.1     71.3 readModifyWrite
          |   118.0s        0         4603.0         3801.9      2.2     16.8     32.5     65.0 read
          |   118.0s        0         4775.9         3794.9     14.7     33.6     46.1     67.1 readModifyWrite
          |   119.0s        0         4253.1         3805.7      2.4     17.8     39.8     83.9 read
          |   119.0s        0         4334.1         3799.5     15.7     39.8     56.6     79.7 readModifyWrite
          |   120.0s        0         4339.4         3810.2      2.4     16.3     37.7     79.7 read
          |   120.0s        0         4280.5         3803.5     16.8     39.8     50.3     75.5 readModifyWrite
        Wraps: (4) exit status 20
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *exec.ExitError

    monitor.go:128,ycsb.go:73,ycsb.go:90,test_runner.go:777: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func1
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:73
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerYCSB.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/ycsb.go:90
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:172
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:81
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6309
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:208
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1371
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Reproduce

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md)

Same failure on other branches

- #69545 roachtest: ycsb/F/nodes=3/cpu=32 failed [bad connection] [similar to 69523] [C-test-failure O-roachtest O-robot T-sql-experience T-sql-queries branch-master] - #69523 roachtest: ycsb/F/nodes=3 failed [bad connection] [C-test-failure O-roachtest O-robot branch-master]

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

rytaft commented 3 years ago

Any updates to report for the triage meeting?

rafiss commented 3 years ago

I'm wondering if it still could be https://github.com/lib/pq/pull/1000

I previously made https://github.com/cockroachdb/cockroach/pull/68665 to handle this, but maybe the

if ctx.Err() != nil && (errors.Is(err, ctx.Err()) || errors.Is(err, driver.ErrBadConn))

check is too strict. perhaps it should be

if (ctx.Err() != nil && errors.Is(err, ctx.Err()) || errors.Is(err, driver.ErrBadConn)

rafiss commented 3 years ago

The reason it could be that is because the error always happens at 120.0s after the test starts -- which is how long the ramp-up period for this test is. After the ramp period, the context is cancelled and a new context is created. see https://github.com/cockroachdb/cockroach/blob/b50be15e0d7aa15da242d0e3949e14d65b798010/pkg/workload/cli/run.go#L443-L477

tbg commented 3 years ago

So you're suggesting we ignore ErrBadConn "always"? That seems wrong, won't this be the error you get if the node is down etc? If I understand the issue correctly it also seems like the best we can do, right? When the soon-to-be-canceled connection gets returned to the connection pool, whoever picks it up will not have a canceled context, so we're essentially checking the wrong context for ctx.Err(). Silly question, why can't we pick up the PR you linked in lib/pq and be done with it?

rafiss commented 3 years ago

So you're suggesting we ignore ErrBadConn "always"?

Yeah, my thinking was that that there might be a weird case where the client hasn't seen the context cancelled but the connection has been closed already. Wasn't able to repro that in any smaller examples though.

That seems wrong, won't this be the error you get if the node is down etc?

ErrBadConn is the error you get if the connection is unexpectedly broken, which maybe could include the node being down. But I'm pretty sure you'd see a timeout first.

Silly question, why can't we pick up the PR you linked in lib/pq and be done with it?

We already have picked up the linked PR. https://github.com/lib/pq/pull/1000 first appears in v1.9.0 and we are now using v1.10.2. To be clear, https://github.com/lib/pq/pull/1000 is the PR that caused the behavior of ErrBadConn getting returned when the context has been canceled and you try to use a connection (see https://github.com/lib/pq/pull/1000#issuecomment-738920726)

rafiss commented 3 years ago

I added the following diff so I could get a stack trace of which worker was actually getting ErrBadConn:

diff --git a/pkg/workload/cli/run.go b/pkg/workload/cli/run.go
index c1d034acb1..6c842f19ce 100644
--- a/pkg/workload/cli/run.go
+++ b/pkg/workload/cli/run.go
@@ -277,6 +277,8 @@ func workerRun(
                                // that has been canceled. See https://github.com/lib/pq/pull/1000
                                return
                        }
+                       log.Infof(ctx, "err: %+v\nctxErr: %+v", err, ctx.Err())
+                       panic(err)
                        errCh <- err
                        continue
                }

After running roachstress.sh for a while:

I210928 19:30:48.157155 1821 workload/cli/run.go:280  [-] 6  err: driver: bad connection
I210928 19:30:48.157155 1821 workload/cli/run.go:280  [-] 6 +(1) driver: bad connection
I210928 19:30:48.157155 1821 workload/cli/run.go:280  [-] 6 +Error types: (1) *errors.errorString
I210928 19:30:48.157155 1821 workload/cli/run.go:280  [-] 6 +ctxErr: <nil>
panic: driver: bad connection

goroutine 1821 [running]:
panic(0x20d24e0, 0xc0000ca270)
        /usr/local/go/src/runtime/panic.go:1065 +0x565 fp=0xc000363eb0 sp=0xc000363de8 pc=0x462d45
github.com/cockroachdb/cockroach/pkg/workload/cli.workerRun(0x4a6f8b0, 0xc001110740, 0xc0004e6240, 0xc00073fb00, 0x0, 0xc000e36620)
        /go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:281 +0x3f0 fp=0xc000363f58 sp=0xc000363eb0 pc=0x18683d0
github.com/cockroachdb/cockroach/pkg/workload/cli.runRun.func3.1(0xc00012a010, 0xc000996040, 0xc0004e6240, 0x0, 0x4a6f8b0, 0xc001110740, 0xc00073fb00, 0x74
, 0xc000e36620)
        /go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:477 +0xec fp=0xc000363f98 sp=0xc000363f58 pc=0x186c9ec
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc000363fa0 sp=0xc000363f98 pc=0x49d281
created by github.com/cockroachdb/cockroach/pkg/workload/cli.runRun.func3
        /go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:467 +0x111

That corresponds to this line: https://github.com/cockroachdb/cockroach/blob/bdb4c1ac404a99cb679643d734f292b81487e15a/pkg/workload/cli/run.go#L475

The notable thing here is that it's coming from the non-cancelled workersCtx, so it makes sense that ctx.Err() would be nil. It seems like a bad connection remained in the pool, and for some reason when the worker tried using it, it couldn't try to find a different connection right away. (As per the sql/driver docs, the sql package is supposed to retry on a new connection when it sees ErrBadConn.

I'm going to try to see where exactly ErrBadConn is coming from...

rafiss commented 3 years ago

I wrote this quick test based on what workload is doing https://gist.github.com/rafiss/698d0bdbeacfd225437db631d1a95e85

It has non-deterministic behavior. Sometimes the test passes, but sometimes it gets a transaction rolled back error where it's expecting ErrBadConn. (It changes when I enable debugging and add breakpoints.) Also, if I run it with make stress, there's a different failure entirely (it gets a "read timed out" error).

rafiss commented 3 years ago

I have a non-YCSB repro and was able to reproduce the same behavior on 21.1. So this is almost certainly a workload-side problem, and if not that it is definitely not a new CRDB v21.2 bug.

It seems like a race condition in lib/pq. The test below fails somewhere around 50% of the time. It fails much less frequently when I run against a local CRDB single-node cluster. But it fails more reliably if I run against a 3-node roachprod cluster. If I add a time.Sleep after the workerRun(rampCtx, errCh, nil /* wg */, nil, workFn, debugFn) line, then I'm not able to get it to fail at all.

So I'll remove the GA-blocker tag since it shouldn't block any release. But I'll keep working on this to prevent this from flaking and will backport whatever workload fix I make to release-21.2

Expand code

``` func workerRun( ctx context.Context, errCh chan<- error, wg *sync.WaitGroup, limiter *rate.Limiter, workFn func(context.Context) error, debugFn func(), ) { if wg != nil { defer wg.Done() } for { if ctx.Err() != nil { return } // Limit how quickly the load generator sends requests based on --max-rate. if limiter != nil { if err := limiter.Wait(ctx); err != nil { return } } if err := workFn(ctx); err != nil { if ctx.Err() != nil && (errors.Is(err, ctx.Err()) || errors.Is(err, driver.ErrBadConn)) { // lib/pq may return either the `context canceled` error or a // `bad connection` error when performing an operation with a context // that has been canceled. See https://github.com/lib/pq/pull/1000 return } // blah debugFn() panic(err) errCh <- err continue } } } // textFormatter produces output meant for quick parsing by humans. The // data is printed as fixed-width columns. Summary rows // are printed at the end. type textFormatter struct { i int numErr int } func (f *textFormatter) rampDone() { f.i = 0 } func (f *textFormatter) outputError(_ error) { f.numErr++ } func (f *textFormatter) outputTick(startElapsed time.Duration, t histogram.Tick) { if f.i%20 == 0 { fmt.Println("_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)") } f.i++ fmt.Printf("%7.1fs %8d %14.1f %14.1f %8.1f %8.1f %8.1f %8.1f %s\n", startElapsed.Seconds(), f.numErr, float64(t.Hist.TotalCount())/t.Elapsed.Seconds(), float64(t.Cumulative.TotalCount())/startElapsed.Seconds(), time.Duration(t.Hist.ValueAtQuantile(50)).Seconds()*1000, time.Duration(t.Hist.ValueAtQuantile(95)).Seconds()*1000, time.Duration(t.Hist.ValueAtQuantile(99)).Seconds()*1000, time.Duration(t.Hist.ValueAtQuantile(100)).Seconds()*1000, t.Name, ) } const totalHeader = "\n_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)" func (f *textFormatter) outputTotal(startElapsed time.Duration, t histogram.Tick) { f.outputFinal(startElapsed, t, "__total") } func (f *textFormatter) outputResult(startElapsed time.Duration, t histogram.Tick) { f.outputFinal(startElapsed, t, "__result") } func (f *textFormatter) outputFinal( startElapsed time.Duration, t histogram.Tick, titleSuffix string, ) { fmt.Println(totalHeader + titleSuffix) if t.Cumulative == nil { return } if t.Cumulative.TotalCount() == 0 { return } fmt.Printf("%7.1fs %8d %14d %14.1f %8.1f %8.1f %8.1f %8.1f %8.1f %s\n", startElapsed.Seconds(), f.numErr, t.Cumulative.TotalCount(), float64(t.Cumulative.TotalCount())/startElapsed.Seconds(), time.Duration(t.Cumulative.Mean()).Seconds()*1000, time.Duration(t.Cumulative.ValueAtQuantile(50)).Seconds()*1000, time.Duration(t.Cumulative.ValueAtQuantile(95)).Seconds()*1000, time.Duration(t.Cumulative.ValueAtQuantile(99)).Seconds()*1000, time.Duration(t.Cumulative.ValueAtQuantile(100)).Seconds()*1000, t.Name, ) } func TestLoop(t *testing.T) { defer leaktest.AfterTest(t)() defer log.Scope(t).Close(t) concurreny := 50 var ramp *time.Duration i := 4 * time.Second ramp = &i totalTime := 2 * time.Second urls := []string{ "postgres://root@:26257?sslmode=disable", "postgres://root@:26257?sslmode=disable", "postgres://root@:26257?sslmode=disable", } //db, err := gosql.Open("postgres", "postgresql://root@localhost:26257/defaultdb?sslmode=disable") db, err := gosql.Open("postgres", urls[0])//strings.Join(urls, ` `)) require.NoError(t, err) defer db.Close() db.SetMaxOpenConns(concurreny + 1) db.SetMaxIdleConns(concurreny + 1) db.Exec("create table if not exists t (a int primary key, b int)") db.Exec("insert into t(a,b) select g,g from generate_series(0,99) as g") readStmt, _ := db.Prepare("select b from t where a = $1") updateStmt, _ := db.Prepare("update t set b = $1 where a = $2") reg := histogram.NewRegistry( 2 * time.Second, "myname", ) handle := reg.GetHandle() workFn := func(ctx context.Context) error { start := timeutil.Now() defer func() { elapsed := timeutil.Since(start) handle.Get("the_op").Record(elapsed) }() return crdb.ExecuteTx(ctx, db, nil, func(tx *gosql.Tx) error { a := rand.Intn(100) var b int if err := tx.StmtContext(ctx, readStmt).QueryRowContext(ctx, a).Scan(&b); err != nil { return err } _, err := tx.StmtContext(ctx, updateStmt).ExecContext(ctx, rand.Intn(100), a) if err != nil { } return err }) } debugFn := func() { stats := db.Stats() fmt.Printf("stats %+v\n", stats) } formatter := &textFormatter{} start := timeutil.Now() errCh := make(chan error) workersCtx, cancelWorkers := context.WithCancel(context.Background()) defer cancelWorkers() var wg sync.WaitGroup wg.Add(concurreny) var rampDone chan struct{} if *ramp > 0 { // Create a channel to signal when the ramp period finishes. Will // be reset to nil when consumed by the process loop below. rampDone = make(chan struct{}) } // If a ramp period was specified, start all of the workers gradually // with a new context. var rampCtx context.Context if rampDone != nil { var cancel func() rampCtx, cancel = context.WithTimeout(workersCtx, *ramp) defer cancel() } go func() { for i := 0; i < concurreny; i++ { go func(i int, workFn func(context.Context) error) { // If a ramp period was specified, start all of the workers // gradually with a new context. if rampCtx != nil { rampPerWorker := *ramp / time.Duration(concurreny) time.Sleep(time.Duration(i) * rampPerWorker) workerRun(rampCtx, errCh, nil /* wg */, nil, workFn, debugFn) //time.Sleep(rampPerWorker) } fmt.Println("start real worker", i) // Start worker again, this time with the main context. workerRun(workersCtx, errCh, &wg, nil, workFn, debugFn) }(i, workFn) } if rampCtx != nil { // Wait for the ramp period to finish, then notify the process loop // below to reset timers and histograms. <-rampCtx.Done() close(rampDone) } }() ticker := time.NewTicker(1 * time.Second) defer ticker.Stop() done := make(chan os.Signal, 3) signal.Notify(done, []os.Signal{unix.SIGINT, unix.SIGTERM}...) go func() { wg.Wait() done <- os.Interrupt }() go func() { time.Sleep(totalTime + *ramp) done <- os.Interrupt }() for { select { case err := <-errCh: formatter.outputError(err) panic(err) case <-ticker.C: startElapsed := timeutil.Since(start) reg.Tick(func(t histogram.Tick) { formatter.outputTick(startElapsed, t) }) // Once the load generator is fully ramped up, we reset the histogram // and the start time to throw away the stats for the ramp up period. case <-rampDone: rampDone = nil start = timeutil.Now() formatter.rampDone() reg.Tick(func(t histogram.Tick) { t.Cumulative.Reset() t.Hist.Reset() }) case <-done: cancelWorkers() startElapsed := timeutil.Since(start) resultTick := histogram.Tick{Name: "the_op"} reg.Tick(func(t histogram.Tick) { formatter.outputTotal(startElapsed, t) if resultTick.Cumulative == nil { resultTick.Now = t.Now resultTick.Cumulative = t.Cumulative } else { resultTick.Cumulative.Merge(t.Cumulative) } }) formatter.outputResult(startElapsed, resultTick) return } } } ```

cockroachdb / cockroach

roachtest: ycsb/F/nodes=3/cpu=32 failed #70019