cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

roachtest: clearrange/zfs/checks=true failed #68303

Closed cockroach-teamcity closed 2 years ago

cockroach-teamcity commented 3 years ago

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 701b177d8f4b81d8654dfb4090a2cd3cf82e63a7:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)
Reproduce

See: [roachtest README](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/roachtest)

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

tbg commented 2 years ago

We drop the bank table with a 20 minute ttl at 9:30, so we expect ClearRange at 9:50, which does seem to happen,

image

The disk usage pattern that I see on n8 (the one that runs out of disk here) is exactly the same as for other nodes:

n8

image

n3

image

so whatever is happening, it seems to be happening everywhere.

@nicktrav isn't it surprising that we're using more disk space here as we use ClearRange (I suppose we still do) to nuke the ranges? I thought these range tombstones would reclaim space very efficiently.

I should get myself the artifacts from a good run (tsdump isn't stored then) to see if the space build-up is also present on the "good" build. I would expect it not to be there, and that should be a clue.

tbg commented 2 years ago

Added support for the roachtest run ... --debug parameter to the stress job and kicked it off on the good SHA here: https://teamcity.cockroachdb.com/viewLog.html?buildId=4169735&

tbg commented 2 years ago

Ah, it had been too long. The --debug flag has no effect if the test actually passes. Ok, will do this stuff on my gceworker then.

cockroach-teamcity commented 2 years ago

roachtest.clearrange/checks=true failed with artifacts on master @ 58ceac139a7e83052171121b28026a7366f16f7e:

          | I220121 10:29:58.439007 323 workload/pgx_helpers.go:72  [-] 33  pgx logger [error]: Exec logParams=map[args:[3876645296904794892 b3] err:unexpected EOF sql:kv-2]
          | W220121 10:29:58.459437 322 workload/pgx_helpers.go:116  [-] 34  error preparing statement. name=kv-1 sql=SELECT k, v FROM kv WHERE k IN ($1) unexpected EOF
          | I220121 10:29:58.439040 308 workload/pgx_helpers.go:72  [-] 35  pgx logger [error]: Exec logParams=map[args:[-6212334672448520100 73] err:unexpected EOF sql:kv-2]
          | I220121 10:29:58.439067 143 workload/pgx_helpers.go:72  [-] 36  pgx logger [error]: Exec logParams=map[args:[-8305204117490163461 8c] err:unexpected EOF sql:kv-2]
          | Error: unexpected EOF
          | COMMAND_PROBLEM: exit status 1
          |    6: 
          | UNCLASSIFIED_PROBLEM: context canceled
          |    7: 
          | UNCLASSIFIED_PROBLEM: context canceled
          |    8: 
          | UNCLASSIFIED_PROBLEM: context canceled
          |    9: 
          | UNCLASSIFIED_PROBLEM: context canceled
          |   10: 
          | UNCLASSIFIED_PROBLEM: context canceled
        Wraps: (4) secondary error attachment
          | COMMAND_PROBLEM: exit status 1
          | (1) COMMAND_PROBLEM
          | Wraps: (2) Node 5. Command with error:
          |   | ``````
          |   | ./cockroach workload run kv --concurrency=32 --duration=1h
          |   | ``````
          | Wraps: (3) exit status 1
          | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
        Wraps: (5) context canceled
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

    monitor.go:127,clearrange.go:207,clearrange.go:39,test_runner.go:780: monitor failure: monitor command failure: unexpected node event: 5: dead (exit status 137)
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:123
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runClearRange
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:207
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClearRange.func1
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:39
          | [...repeated from below...]
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:202
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1581
        Wraps: (4) monitor command failure
        Wraps: (5) unexpected node event: 5: dead (exit status 137)
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

Same failure on other branches

- #73013 roachtest: clearrange/checks=true failed [C-test-failure O-roachtest O-robot T-storage branch-release-21.1] - #70306 roachtest: clearrange/zfs/checks=true failed [C-test-failure O-roachtest O-robot T-storage branch-release-21.2]

This test on roachdash | Improve this report!

tbg commented 2 years ago

Got a good and bad run with both clusters around. In the bad cluster, four nodes out of 10 died with out of disk. Luckily, I can remove the emergency ballast and start them again! That way, I can poke at the UI a bit. (I probably also have the tsdump in the artifacts, but still).

First of all, can confirm that on the healthy run disk usage doesn't explode when the ClearRange part starts:

image

The bad run didn't even really get to that part:

image

so I don't think there's anything particular about the ClearRange here. Look how much steeper the line is for the failing test.

Poking at the store directory shows that basically all of the usage is in SSTs.

Comparing the raft specific graphs (under advanced debug) is helpful. Stark differences!

Good:

image

Bad:

image

So things really fly off the rail here, I think when we head to the ClearRange phase. You can also see this below, the good cluster pushes 124000 appends per second, but the bad one only 612... something is very broken.


Good:

image

Bad:

image
tbg commented 2 years ago

I scrutinized the commit over and over and can't find a behavior change (other than a few allocs). But this graph is pretty damning, comparing pretty much any node in the good cluster to n1 on the bad cluster (which crashed first):

image

Something is changing the behavior here, and since the bad run crashed on the import, i.e. never got past this point

https://github.com/cockroachdb/cockroach/blob/21a286f2b2978b1cc117596865cd680dd268cf74/pkg/cmd/roachtest/tests/clearrange.go#L62-L73

this isn't anything particular about this test, I don't think. I'm going to try backing parts of my change back out until things work again. Not my favorite strategy, but I don't have anything better.

tbg commented 2 years ago

I wish this were more conclusive. I ran four variations:

cb-good and cb-bad passed (sad). In both cases, the usage looked like this, i.e. nodes had easily >130GB to spare:

image

cb-bad-rem1 on the other hand went up in flames:

image

cb-bad-rem2 is still running, but is looking great and is unlikely to run out of space now.

We can "conclude" that the problem is still present in cb-bad-rem1, i.e. the changes in refreshProposalsLocked are not to blame (since they have been removed here). The diff between cb-bad-rem1 and cb-bad-rem2 is small, and hopefully I can collect more evidence (via additional runs) that it makes the difference.

diff --git a/pkg/kv/kvserver/replica_send.go b/pkg/kv/kvserver/replica_send.go
index ad67decb08..eced34ac05 100644
--- a/pkg/kv/kvserver/replica_send.go
+++ b/pkg/kv/kvserver/replica_send.go
@@ -176,7 +176,7 @@ func maybeAdjustWithBreakerError(pErr *roachpb.Error, brErr error) *roachpb.Erro
 // github.com/cockroachdb/cockroach/pkg/storage.(*Replica).sendWithRangeID(0xc420d1a000, 0x64bfb80, 0xc421564b10, 0x15, 0x153fd4634aeb0193, 0x0, 0x100000001, 0x1, 0x15, 0x0, ...)
 func (r *Replica) sendWithRangeID(
    ctx context.Context, _forStacks roachpb.RangeID, ba *roachpb.BatchRequest,
-) (_ *roachpb.BatchResponse, rErr *roachpb.Error) {
+) (*roachpb.BatchResponse, *roachpb.Error) {
    var br *roachpb.BatchResponse
    if r.leaseholderStats != nil && ba.Header.GatewayNodeID != 0 {
        r.leaseholderStats.record(ba.Header.GatewayNodeID)
@@ -193,17 +193,6 @@ func (r *Replica) sendWithRangeID(
        return nil, roachpb.NewError(err)
    }

-   // Circuit breaker handling.
-   ctx, cancel := context.WithCancel(ctx)
-   brSig, err := r.checkCircuitBreaker(ctx, cancel)
-   if err != nil {
-       return nil, roachpb.NewError(err)
-   }
-   defer func() {
-       rErr = maybeAdjustWithBreakerError(rErr, brSig.Err())
-       cancel()
-   }()
-
    if err := r.maybeBackpressureBatch(ctx, ba); err != nil {
        return nil, roachpb.NewError(err)
    }
@@ -215,7 +204,7 @@ func (r *Replica) sendWithRangeID(
    }

    // NB: must be performed before collecting request spans.
-   ba, err = maybeStripInFlightWrites(ba)
+   ba, err := maybeStripInFlightWrites(ba)
    if err != nil {
        return nil, roachpb.NewError(err)
    }
nicktrav commented 2 years ago

Thanks for all your work here @tbg!

so I don't think there's anything particular about the ClearRange here

Just wanted to confirm this was also the hunch when we started looking into this. ClearRange just happened to be a clean reproducer of the failure mode, which made the bisect simpler.

There was also a comment from @dt internally that links in #73331 (looks like you commented there already). That test doesn't look like it's failed again since a few days ago.

Thanks again.

tbg commented 2 years ago

Running three of each now. Going to write down how to open the UIs hoping that I'll never need it again:

for ip in $(roachprod list -q --json 2>/dev/null | jq '.clusters[] | select(.name | startswith("tobias")) | .vms[0]["public_ip"]' | sed 's/"//g'); do firefox "http://$ip:26258/#/metrics/hardware/cluster"; done
tbg commented 2 years ago

It's not as clear-cut as running the full clearrange, but if I short-circuit the test after the import, I yet have to see an out-of-disk or even anything "bad looking" with cb-bad-rem2, but I do get the occasional (maybe two in five) runs on cb-bad-rem1 where capacity runs much lower than on the other runs, and in some cases runs straight into the ground, such as here:

image

Staring at this I can only guess that introducing a cancelable context in sendWithRangeID somehow causes all of these behavior changes. I have a hard time convincing myself that this can be the case, but let's see what happens if I take cb-good and add a cancelable context in there (this is now running as the cb-good-addcancel branch).

tbg commented 2 years ago

Wow. I went back to running the full clearrange test (rather than just the import), and I took the good commit (ad59351e4b8a2581d6ce53e113b4e242ac7ebc33) and added the following diff:

index 86956fb884..4b7801a508 100644
--- a/pkg/kv/kvserver/replica_send.go
+++ b/pkg/kv/kvserver/replica_send.go
@@ -109,7 +109,7 @@ func (r *Replica) Send(
 // github.com/cockroachdb/cockroach/pkg/storage.(*Replica).sendWithRangeID(0xc420d1a000, 0x64bfb80, 0xc421564b10, 0x15, 0x153fd4634aeb0193, 0x0, 0x100000001, 0x1, 0x15, 0x0, ...)
 func (r *Replica) sendWithRangeID(
        ctx context.Context, _forStacks roachpb.RangeID, ba *roachpb.BatchRequest,
-) (*roachpb.BatchResponse, *roachpb.Error) {
+) (_ *roachpb.BatchResponse, rErr *roachpb.Error) {
        var br *roachpb.BatchResponse
        if r.leaseholderStats != nil && ba.Header.GatewayNodeID != 0 {
                r.leaseholderStats.record(ba.Header.GatewayNodeID)
@@ -126,6 +126,12 @@ func (r *Replica) sendWithRangeID(
                return nil, roachpb.NewError(err)
        }

+       ctx, cancel := context.WithCancel(ctx)
+       defer func() {
+               _ = rErr
+               cancel()
+       }()
+
        if err := r.maybeBackpressureBatch(ctx, ba); err != nil {
                return nil, roachpb.NewError(err)
        }

All five reproductions went up in flames, with multiple nodes running out of disk, like this one:

image

That gives me a way to quick-fix this at least, and I'm looking forward to understanding exactly how this innocuous change can have such an outsized effect. Has to be something exotic like this causing some overhead that then ... I don't know, starves pebble compaction goroutines?

cockroach-teamcity commented 2 years ago

roachtest.clearrange/checks=true failed with artifacts on master @ dc07599dc9db1acd5afa3a6537297815f25c1fca:

          | Error: unexpected EOF
          | COMMAND_PROBLEM: exit status 1
          |    4: 
          | UNCLASSIFIED_PROBLEM: context canceled
          |    5: 
          | UNCLASSIFIED_PROBLEM: context canceled
          |    6: 
          | UNCLASSIFIED_PROBLEM: context canceled
          |    7: 
          | UNCLASSIFIED_PROBLEM: context canceled
          |    8: 
          | UNCLASSIFIED_PROBLEM: context canceled
          |    9: 
          | UNCLASSIFIED_PROBLEM: context canceled
          |   10: 
          | UNCLASSIFIED_PROBLEM: context canceled
        Wraps: (4) secondary error attachment
          | COMMAND_PROBLEM: exit status 1
          | (1) COMMAND_PROBLEM
          | Wraps: (2) Node 3. Command with error:
          |   | ``````
          |   | ./cockroach workload run kv --concurrency=32 --duration=1h
          |   | ``````
          | Wraps: (3) exit status 1
          | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
        Wraps: (5) context canceled
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

    monitor.go:127,clearrange.go:207,clearrange.go:39,test_runner.go:780: monitor failure: monitor command failure: unexpected node event: 3: dead (exit status 10)
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:123
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runClearRange
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:207
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClearRange.func1
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:39
          | [...repeated from below...]
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:202
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1581
        Wraps: (4) monitor command failure
        Wraps: (5) unexpected node event: 3: dead (exit status 10)
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

Same failure on other branches

- #70306 roachtest: clearrange/zfs/checks=true failed [C-test-failure O-roachtest O-robot T-storage branch-release-21.2]

This test on roachdash | Improve this report!

erikgrinaker commented 2 years ago

Has to be something exotic like this causing some overhead that then ... I don't know, starves pebble compaction goroutines?

Could it be the context cancellation itself, e.g. if there's some sort of async work that's using the caller's context and getting cancelled before it completes? I've seen this happen before with e.g. transaction record cleanup. What happens if you don't cancel the context (or if the context leak is problematic, cancel it asynchronously after a long delay)?

cockroach-teamcity commented 2 years ago

roachtest.clearrange/checks=true failed with artifacts on master @ e1068d77afbd39b162978281c9da7cbea49c1c3a:

          |
          | stdout:
          | <... some data truncated by circular buffer; go to artifacts for details ...>
          | go:79  [-] 7  pgx logger [error]: Exec logParams=map[args:[-3842322598880802372 be] err:ERROR: result is ambiguous (error=rpc error: code = Unavailable desc = error reading from server: read tcp 10.142.0.39:43232->10.142.0.157:26257: read: connection reset by peer [propagate]) (SQLSTATE 40003) sql:kv-2]
          | I220123 10:05:49.755999 339 workload/pgx_helpers.go:79  [-] 8  pgx logger [error]: Exec logParams=map[args:[-6823504649527054705 1e] err:ERROR: result is ambiguous (error=rpc error: code = Unavailable desc = error reading from server: read tcp 10.142.0.39:43232->10.142.0.157:26257: read: connection reset by peer [propagate]) (SQLSTATE 40003) sql:kv-2]
          | I220123 10:05:49.753494 332 workload/pgx_helpers.go:79  [-] 9  pgx logger [error]: Exec logParams=map[args:[-3953251592303995929 c2] err:ERROR: result is ambiguous (error=rpc error: code = Unavailable desc = error reading from server: read tcp 10.142.0.39:43232->10.142.0.157:26257: read: connection reset by peer [propagate]) (SQLSTATE 40003) sql:kv-2]
          | I220123 10:05:49.753449 330 workload/pgx_helpers.go:79  [-] 4  pgx logger [error]: Exec logParams=map[args:[-3652645442218805946 23] err:ERROR: result is ambiguous (error=rpc error: code = Unavailable desc = error reading from server: read tcp 10.142.0.39:43232->10.142.0.157:26257: read: connection reset by peer [propagate]) (SQLSTATE 40003) sql:kv-2]
          | I220123 10:05:49.753579 350 workload/pgx_helpers.go:79  [-] 11  pgx logger [error]: Exec logParams=map[args:[-3718566393622470452 ef] err:ERROR: result is ambiguous (error=rpc error: code = Unavailable desc = error reading from server: read tcp 10.142.0.39:43232->10.142.0.157:26257: read: connection reset by peer [propagate]) (SQLSTATE 40003) sql:kv-2]
          | I220123 10:05:49.755994 338 workload/pgx_helpers.go:79  [-] 10  pgx logger [error]: Exec logParams=map[args:[-6636874609350470789 3c] err:ERROR: result is ambiguous (error=rpc error: code = Unavailable desc = error reading from server: read tcp 10.142.0.39:43232->10.142.0.157:26257: read: connection reset by peer [propagate]) (SQLSTATE 40003) sql:kv-2]
          | I220123 10:05:49.753449 340 workload/pgx_helpers.go:79  [-] 12  pgx logger [error]: Exec logParams=map[args:[-6649699782819420343 48] err:ERROR: result is ambiguous (error=rpc error: code = Unavailable desc = error reading from server: read tcp 10.142.0.39:43232->10.142.0.157:26257: read: connection reset by peer [propagate]) (SQLSTATE 40003) sql:kv-2]
          | I220123 10:05:49.753450 335 workload/pgx_helpers.go:79  [-] 13  pgx logger [error]: Exec logParams=map[args:[-3851759784749341330 08] err:ERROR: result is ambiguous (error=rpc error: code = Unavailable desc = error reading from server: read tcp 10.142.0.39:43232->10.142.0.157:26257: read: connection reset by peer [propagate]) (SQLSTATE 40003) sql:kv-2]
          | I220123 10:05:49.753464 348 workload/pgx_helpers.go:79  [-] 15  pgx logger [error]: Exec logParams=map[args:[3418797795528818086 61] err:ERROR: result is ambiguous (error=rpc error: code = Unavailable desc = error reading from server: read tcp 10.142.0.39:43232->10.142.0.157:26257: read: connection reset by peer [propagate]) (SQLSTATE 40003) sql:kv-2]
          | I220123 10:05:49.756028 328 workload/pgx_helpers.go:79  [-] 14  pgx logger [error]: Exec logParams=map[args:[-6282366828672893145 73] err:ERROR: result is ambiguous (error=rpc error: code = Unavailable desc = error reading from server: read tcp 10.142.0.39:43232->10.142.0.157:26257: read: connection reset by peer [propagate]) (SQLSTATE 40003) sql:kv-2]
          | I220123 10:05:49.756030 342 workload/pgx_helpers.go:79  [-] 16  pgx logger [error]: Exec logParams=map[args:[-6833511911441222454 31] err:ERROR: result is ambiguous (error=rpc error: code = Unavailable desc = error reading from server: read tcp 10.142.0.39:43232->10.142.0.157:26257: read: connection reset by peer [propagate]) (SQLSTATE 40003) sql:kv-2]
          | Error: ERROR: result is ambiguous (error=rpc error: code = Unavailable desc = error reading from server: read tcp 10.142.0.39:43232->10.142.0.157:26257: read: connection reset by peer [propagate]) (SQLSTATE 40003)
          | COMMAND_PROBLEM: exit status 1
        Wraps: (4) secondary error attachment
          | COMMAND_PROBLEM: exit status 1
          | (1) COMMAND_PROBLEM
          | Wraps: (2) Node 1. Command with error:
          |   | ``````
          |   | ./cockroach workload run kv --concurrency=32 --duration=1h
          |   | ``````
          | Wraps: (3) exit status 1
          | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
        Wraps: (5) context canceled
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

    monitor.go:127,clearrange.go:207,clearrange.go:39,test_runner.go:780: monitor failure: monitor task failed: pq: query execution canceled due to statement timeout
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:123
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runClearRange
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:207
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClearRange.func1
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:39
          | [...repeated from below...]
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:171
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1581
        Wraps: (4) monitor task failed
        Wraps: (5) pq: query execution canceled due to statement timeout
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *pq.Error
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

Same failure on other branches

- #70306 roachtest: clearrange/zfs/checks=true failed [C-test-failure O-roachtest O-robot T-storage branch-release-21.2]

This test on roachdash | Improve this report!

tbg commented 2 years ago

Good theory, I'm currently testing it out with this diff (basically verifying if the "ctx cancellation is observed by anyone after the fact". It's still running but it looks like the kvserver test package is going to pass, sadly. Going to try some Bulk related tests after that. And run a few instances of clearrange where I don't actually cancel the context.

cockroach-teamcity commented 2 years ago

roachtest.clearrange/checks=true failed with artifacts on master @ 8cd28089c6c7333615ba3201e841839001d2f0e1:

          |   918.0s        0            0.0         3598.9      0.0      0.0      0.0      0.0 write
          |   919.0s        0            0.0         3594.9      0.0      0.0      0.0      0.0 write
          |   920.0s        0            0.0         3591.0      0.0      0.0      0.0      0.0 write
          | _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
          |   921.0s        0            0.0         3587.1      0.0      0.0      0.0      0.0 write
          |   922.0s        0            0.0         3583.2      0.0      0.0      0.0      0.0 write
          |   923.0s        0            0.0         3579.4      0.0      0.0      0.0      0.0 write
          |   924.0s        0            0.0         3575.5      0.0      0.0      0.0      0.0 write
          | I220124 10:18:58.695178 328 workload/pgx_helpers.go:79  [-] 3  pgx logger [error]: Exec logParams=map[args:[3076834385639398857 b2] err:ERROR: result is ambiguous (error=rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout [propagate]) (SQLSTATE 40003) sql:kv-2]
          | Error: ERROR: result is ambiguous (error=rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout [propagate]) (SQLSTATE 40003)
          | COMMAND_PROBLEM: exit status 1
        Wraps: (4) COMMAND_PROBLEM
        Wraps: (5) Node 1. Command with error:
          | ``````
          | ./cockroach workload run kv --concurrency=32 --duration=1h
          | ``````
        Wraps: (6) exit status 1
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError

    monitor.go:127,clearrange.go:207,clearrange.go:39,test_runner.go:780: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:123
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runClearRange
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:207
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClearRange.func1
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:39
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:780
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:171
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:80
          | runtime.doInit
          |     /usr/local/go/src/runtime/proc.go:6498
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:238
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1581
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

Same failure on other branches

- #70306 roachtest: clearrange/zfs/checks=true failed [C-test-failure O-roachtest O-robot T-storage branch-release-21.2]

This test on roachdash | Improve this report!

tbg commented 2 years ago

Was unable to trigger the assertion at all. This suggests that the context cancellation itself is not the problem, rather that the problem is that we're making a cancelable context. This doesn't seem to make much sense, going to verify this again with a suite of clearrange runs.

tbg commented 2 years ago

after 3/3 bad runs on cb-bad-addcancel (https://github.com/tbg/cockroach/compare/cb-good...tbg:cb-good-addcancel?expand=1, both ran the entire clearrange and cb-good passed all), where the context was made cancelable and replaced the original context, going to run three of these: https://github.com/tbg/cockroach/compare/cb-good...tbg:cb-good-addcancel-2?expand=1 (where we make the same context, but it can't possibly make any difference except a few allocs). I'm also doing three where the cancel() isn't even invoked (https://github.com/tbg/cockroach/compare/cb-good...tbg:cb-good-addcancel-3?expand=1)

tbg commented 2 years ago

Finally getting somewhere here.

The cancelable context broke checksum recomputations:

https://github.com/cockroachdb/cockroach/blob/3a80d0302fb5dab3c5e1f3960d8eca8ef8ab8201/pkg/kv/kvserver/replica_proposal.go#L234-L235

(ctx above is the proposal ctx, i.e. is now canceled once the request comes out of raft).

This explains these log messages (which I can only interpret with hindsight, but would've also been a clue):

E220123 10:04:44.680112 4697337 kv/kvserver/queue.go:1095 ⋮ [n5,consistencyChecker,s5,r10046/1:‹/Table/59/1/{240-320}›] 6633 computing own checksum: ‹rpc error: code = Unknown desc = no checksum found (ID = 91e98c39-3b18-42d3-9aa2-83ecc9f108d4)›

We probably have a lot of ranges with estimates that clock in at "0 bytes", due to the bank fixture import . Consistency checks, as a byproduct, trigger RecomputeStatsRequest that fix these up. Consistency checks were not succeeding due to the above, so stats remained wrong. ClearRange was then likely using write batches to delete 2TB of data, running the nodes out of disk. As of https://github.com/cockroachdb/cockroach/pull/74674 (which landed after the offending commit cb-bad, so it has been absent from all of my repro attempts) ClearRange should always write rangedels in this case, so this reasoning doesn't extend to recent master where we still see failures of this test.

In fact though, looking at the test failure two up from this comment, we don't see any out-of-disk. The nodes in the cluster are up, but n5 crashed with exit code 8, and it still has disk available (according to diskusage.txt). This suggests that we're perhaps looking at a different failure mode on master, or other places where the ctx cancellation wreaks havoc, which should now be easy to find.

erikgrinaker commented 2 years ago

The cancelable context broke checksum recomputations:

https://github.com/cockroachdb/cockroach/blob/3a80d0302fb5dab3c5e1f3960d8eca8ef8ab8201/pkg/kv/kvserver/replica_proposal.go#L234-L235

(ctx above is the proposal ctx, i.e. is now canceled once the request comes out of raft).

That link points to a fixed version of the code. You probably mean on master, e.g.:

https://github.com/cockroachdb/cockroach/blob/6664d0c34df0fea61de4fff1e97987b7de609b9e/pkg/kv/kvserver/replica_proposal.go#L239

tbg commented 2 years ago

Today's failure:

    | Error: ERROR: result is ambiguous (error=rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout [propagate]) (SQLSTATE 40003)

No nodes crashed there either.

tbg commented 2 years ago

I don't have a great theory for what is wrong on master then, but the failing consistency checks might still have to do with it. They could lead to a worse distribution (but of what? There is no load). I think it's worth understanding this, even if fixing the ctx for the checksum will magically fix all issues.

tbg commented 2 years ago

Consistent with the theory above, the various experiments I had running (where we don't actually cancel the send ctx) passed. I'm not trying the bad sha + the fix, which I also expect to pass.

tbg commented 2 years ago

The "bad commit" passed 3/3 with #75448, which is pretty conclusive given that the previous pass rate was 0% over maybe a dozen runs.

tbg commented 2 years ago

I still need to confirm this, but I don't think that the cancellations of the consistency checker somehow "overloaded" the system. Rather, what we are seeing is how consistent stats are linked to stability or rather, that wildly incorrect stats are a liability. We have so far not explicitly investigated this link.

The fact that this test also continued to fail on master after the more obvious ClearRange problem (#74674) was fixed means that there is a lingering problem here that will manifest whenever the consistency checker doesn't run at the cadence at which we test (or is outright disabled).

I'm going to see what reproduction rate I can get for the current master failure mode when I intentionally disable the consistency checker. I will then file a separate issue with my findings and consider this issue closed.

tbg commented 2 years ago

Running master @ e8a0b75e227bf2b07207383f1ed7173da8321538 with https://github.com/cockroachdb/cockroach/pull/75448 cherry-picked on top, and both with and without the following diff:

diff --git a/pkg/cmd/roachtest/tests/clearrange.go b/pkg/cmd/roachtest/tests/clearrange.go
index 81d967f253..9f7a11c583 100644
--- a/pkg/cmd/roachtest/tests/clearrange.go
+++ b/pkg/cmd/roachtest/tests/clearrange.go
@@ -66,6 +66,12 @@ func runClearRange(ctx context.Context, t test.Test, c cluster.Cluster, aggressi
        t.Status("restoring fixture")
        c.Start(ctx, t.L(), option.DefaultStartOpts(), install.MakeClusterSettings())

+       {
+               // Disable consistency checker.
+               db := c.Conn(ctx, t.L(), 1)
+               db.Exec(`SET CLUSTER SETTING server.consistency_check.interval = '0s'`)
+       }
+
        // NB: on a 10 node cluster, this should take well below 3h.
        tBegin := timeutil.Now()
        c.Run(ctx, c.Node(1), "./cockroach", "workload", "fixtures", "import", "bank",

My expectation will be that with this diff, we'll sporadically see the failure modes we've had in this issue that were not out-of-disk problems.

cockroach-teamcity commented 2 years ago

roachtest.clearrange/checks=true failed with artifacts on master @ c4c5ca2fdd5a641433a85a28d4dfd3bd4443015d:

          | I220125 10:33:09.997437 306 workload/pgx_helpers.go:79  [-] 23  pgx logger [error]: Exec logParams=map[args:[-2988038615385819967 1d] err:unexpected EOF sql:kv-2]
          | I220125 10:33:09.993355 315 workload/pgx_helpers.go:79  [-] 24  pgx logger [error]: Exec logParams=map[args:[-1172248208485193873 84] err:unexpected EOF sql:kv-2]
          | I220125 10:33:09.997455 321 workload/pgx_helpers.go:79  [-] 25  pgx logger [error]: Exec logParams=map[args:[-3738235711778727438 44] err:unexpected EOF sql:kv-2]
          | I220125 10:33:09.997479 286 workload/pgx_helpers.go:79  [-] 26  pgx logger [error]: Exec logParams=map[args:[-3705764740161759665 56] err:unexpected EOF sql:kv-2]
          | I220125 10:33:09.993411 287 workload/pgx_helpers.go:79  [-] 27  pgx logger [error]: Exec logParams=map[args:[-2963621068703909299 35] err:unexpected EOF sql:kv-2]
          | I220125 10:33:09.997496 318 workload/pgx_helpers.go:79  [-] 28  pgx logger [error]: Exec logParams=map[args:[-2884792128320511837 ca] err:unexpected EOF sql:kv-2]
          | I220125 10:33:09.993211 285 workload/pgx_helpers.go:79  [-] 29  pgx logger [error]: Exec logParams=map[args:[-1352653299656574591 74] err:unexpected EOF sql:kv-2]
          | I220125 10:33:09.997513 325 workload/pgx_helpers.go:79  [-] 30  pgx logger [error]: Exec logParams=map[args:[-3117029817878398178 ea] err:unexpected EOF sql:kv-2]
          | I220125 10:33:09.997530 283 workload/pgx_helpers.go:79  [-] 31  pgx logger [error]: Exec logParams=map[args:[-3233077485044277003 c8] err:unexpected EOF sql:kv-2]
          | I220125 10:33:09.997546 314 workload/pgx_helpers.go:79  [-] 32  pgx logger [error]: Exec logParams=map[args:[-1003769562807427333 60] err:unexpected EOF sql:kv-2]
          | I220125 10:33:09.997563 284 workload/pgx_helpers.go:79  [-] 33  pgx logger [error]: Exec logParams=map[args:[-3449262313979188154 e3] err:unexpected EOF sql:kv-2]
          | I220125 10:33:09.997580 307 workload/pgx_helpers.go:79  [-] 34  pgx logger [error]: Exec logParams=map[args:[-2842308451165463865 90] err:unexpected EOF sql:kv-2]
          | I220125 10:33:09.997596 310 workload/pgx_helpers.go:79  [-] 35  pgx logger [error]: Exec logParams=map[args:[-4107899281630403119 9b] err:unexpected EOF sql:kv-2]
          | I220125 10:33:09.997611 279 workload/pgx_helpers.go:79  [-] 36  pgx logger [error]: Exec logParams=map[args:[-2786884902433063161 3a] err:unexpected EOF sql:kv-2]
          | Error: unexpected EOF
          | COMMAND_PROBLEM: exit status 1
        Wraps: (4) secondary error attachment
          | COMMAND_PROBLEM: exit status 1
          | (1) COMMAND_PROBLEM
          | Wraps: (2) Node 10. Command with error:
          |   | ``````
          |   | ./cockroach workload run kv --concurrency=32 --duration=1h
          |   | ``````
          | Wraps: (3) exit status 1
          | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
        Wraps: (5) context canceled
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

    monitor.go:127,clearrange.go:207,clearrange.go:39,test_runner.go:780: monitor failure: monitor command failure: unexpected node event: 10: dead (exit status 10)
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:115
          | main.(*monitorImpl).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:123
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runClearRange
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:207
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClearRange.func1
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:39
          | [...repeated from below...]
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:202
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1581
        Wraps: (4) monitor command failure
        Wraps: (5) unexpected node event: 10: dead (exit status 10)
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString
Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

Same failure on other branches

- #70306 roachtest: clearrange/zfs/checks=true failed [C-test-failure O-roachtest O-robot T-storage branch-release-21.2]

This test on roachdash | Improve this report!

tbg commented 2 years ago

Foolish of me to state my expectations before the results. Left is consistency checker on, right is consistency checker off. Looks identical and no failures.

image
tbg commented 2 years ago

Failure above is out of disk:

monitor.go:127,clearrange.go:207,clearrange.go:39,test_runner.go:780: monitor failure: monitor command failure: unexpected node event: 10: dead (exit status 10)

image image image

Given that turning the consistency checker off doesn't reproduce the problem, I'll have another look at whether having these "cancel" runs for the consistency checker can somehow negatively impair the system (in addition to preventing stats fixes).

Edit: reading the code here:

https://github.com/cockroachdb/cockroach/blob/34a8d5d8bcbc36d1663fc900aae4ab3b5197473a/pkg/kv/kvserver/consistency_queue.go#L149-L196

and having convinced myself that context cancellation in computeChecksumPostApply is handled well by the rate limiter, I think the cancellation bug should really end up doing less work for the consistency checks, not more.

tbg commented 2 years ago

Just noticing that it seems to be only the checks=true flavor that is failing. In that mode, we have more calls to the consistency checker that bypass the lastProcessed thing. But regardless of whether these checks go through or get canceled, there should be about the same amount of work in the system (actually less on cancellation). Still confused why it makes a difference.

jbowens commented 2 years ago

Random drive-by comment: With the cancellations, is there anything different about the number of engine snapshots opened and closed? It looks to me like we appropriately close the snapshot if the context is cancelled. The out-of-disk failure mode and the number of range deletions in L6 (almost 2 per sstable) could possibly be explained by not reclaiming disk space due to open snapshots. Range tombstones in L6 can also be explained by sstables ingested directly into L6 during import, but I think those would have 1 range tombstone per sstable.

tbg commented 2 years ago

Hmm. Interesting. I double checked too and it does not look like we're leaking snapshots. Is there a metric on number of open snaps by any chance? Or a way to hack one in?

tbg commented 2 years ago

Looks like (*DB).mu.snapshots is the list of open snapshots. I'm going to hack something that prints that regularly, but perhaps someone from the storage side could file an issue to export that in metrics.

nicktrav commented 2 years ago

perhaps someone from the storage side could file an issue to export that in metrics.

On it.

tbg commented 2 years ago

Going to try something like this

diff --git a/github.com/cockroachdb/pebble/open.go b/github.com/cockroachdb/pebble/open.go
index ce340acac5..e03d5db578 100644
--- a/github.com/cockroachdb/pebble/open.go
+++ b/github.com/cockroachdb/pebble/open.go
@@ -459,6 +459,20 @@ func Open(dirname string, opts *Options) (db *DB, _ error) {
    d.maybeScheduleFlush()
    d.maybeScheduleCompaction()

+   go func() {
+       for {
+           select {
+           case <-time.After(10*time.Second):
+           case <-d.closedCh:
+                   return
+           }
+           d.mu.Lock()
+           n := len(d.mu.snapshots.toSlice())
+           d.mu.Unlock()
+           d.opts.Logger.Infof("%d snapshots open", n)
+       }
+   }()
+
    // Note: this is a no-op if invariants are disabled or race is enabled.
    //
    // Setting a finalizer on *DB causes *DB to never be reclaimed and the
jbowens commented 2 years ago

Put up cockroachdb/pebble#1472

tbg commented 2 years ago

This is from the most recent failure above (n10 went out of disk), is there anything of note here? This was printed when the node was already fairly close to going down.

I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 +__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 +    WAL         1   7.1 K       -   1.7 M       -       -       -       -   1.8 M       -       -       -     1.1
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 +      0         0     0 B    0.00   1.8 M    12 K      10     0 B       0   465 K      29     0 B       0     0.2
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 +      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 +      2         0     0 B    0.00   341 K   137 K     113   141 K      13   425 K      20   566 K       0     1.2
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 +      3         0     0 B    0.00   468 K    47 K      39   9.7 K       1   850 K      47   899 K       0     1.8
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 +      4        89   479 M    0.99   464 K   5.9 K       5    20 K      19   270 M      45   271 M       1   595.7
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 +      5      1517    22 G    1.00    47 M   3.4 G     211    11 K      11   644 M      65   898 M       1    13.6
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 +      6     10904   221 G       -    67 G    15 G      60   312 M      35    52 G   1.1 K    78 G       1     0.8
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 +  total     12510   243 G       -    18 G    18 G     438   312 M      79    71 G   1.3 K    79 G       3     3.9
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 +  flush        29
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 +compact       930     0 B   4.4 K       0          (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 +  ctype       171      20     660      79       0  (default, delete, elision, move, read)
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 + memtbl         1   1.0 M
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 +zmemtbl        28    27 M
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 +   ztbl      2368    90 G
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 + bcache     246 K   3.5 G   15.2%  (score == hit-rate)
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 + tcache      14 K   9.3 M   97.2%  (score == hit-rate)
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 + titers      1491
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 + filter         -       -   87.1%  (score == utility)
tbg commented 2 years ago

Also just realized that in the same run, clearrange/checks=false also timed out, but after 6h and doing basically nothing:

Cluster successfully initialized
09:34:22 cockroach.go:214: teamcity-4198403-1643096133-31-n10cpu16: setting cluster settings
09:34:23 cockroach.go:220: SET CLUSTER SETTING
SET CLUSTER SETTING
SET CLUSTER SETTING
16:04:00 test_runner.go:797: tearing down after timeout; see teardown.log

Will probably take some more elbow grease to understand all of these different failures.

tbg commented 2 years ago

FWIW here are the test histories:

checks=false

image

checks=true

image

They both show, occasionally, this "timeout" failure mode (the 6h spikes). At least they very reliably fail. I assume that starting tomorrow, they will reliably pass (since I already got 10 passes locally), and we will be left wondering how https://github.com/cockroachdb/cockroach/pull/75448 could have such an impact.

tbg commented 2 years ago

Ah, the 6h+ failure mode is a loss of quorum during the import:

cockroach exited with code 10: Tue Jan 25 10:03:30 UTC 2022

It isn't caught by the test harness since the import is not guarded by a c.Monitor. I'll send a PR to fix that.

nicktrav commented 2 years ago

is there anything of note here?

I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 +__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-
...
I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 +      4        89   479 M    0.99   464 K   5.9 K       5    20 K      19   270 M      45   271 M       1   595.7
...

That's probably the highest write amp I've seen on a DB before, let alone a single level. I'm not confident enough to say that's a definite problem though. @jbowens - thoughts on that?

It would be interesting to see the compaction logs for the node, and we can take our new compaction log tool for a spin.

jbowens commented 2 years ago

I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 + ztbl 2368 90 G

This looks concerning ^. It's saying that there are 90 GB of sstables that are no longer part of the LSM but can't be deleted because there exists an iterator that's reading a version of the LSM that included those files.

tbg commented 2 years ago

The artifacts are from this run: https://github.com/cockroachdb/cockroach/issues/68303#issuecomment-1021041661 The pebble looks should be in there.

tbg commented 2 years ago

This looks concerning ^. It's saying that there are 90 GB of sstables that are no longer part of the LSM but can't be deleted because there exists an iterator that's reading a version of the LSM that included those files.

So, definitely iterator, not snapshot? Maybe we're leaking an iterator when the context cancels, let me do a quick check.

tbg commented 2 years ago

Bummer, would've been really easy for this code to leak iter but nope, we are closing it on error:

https://github.com/cockroachdb/cockroach/blob/eec9dc306b2255c8033966b61bb0787b69018437/pkg/kv/kvserver/replica_consistency.go#L634-L642

jbowens commented 2 years ago

That's probably the highest write amp I've seen on a table before, let alone a single level. I'm not confident enough to say that's a definite problem though.

Yeah, that write amp is really high. It might be an artifact of the fact that the node restarted ~shortly before these metrics were taken.

I220125 10:27:07.367717 445 kv/kvserver/store.go:3183 ⋮ [n10,s10] 733 + total 12510 243 G - 18 G 18 G 438 312 M 79 71 G 1.3 K 79 G 3 3.9

It looks like 18 G was written to the LSM since the process started, but there's 243 G total in the LSM. I suspect L4 was much larger when the process started, and many range tombstones passed through L4, recompacting its data several times until just the 479 M was left. Range tombstones are small (low in column contribution) but can cause a lot of compaction write activity, leading to large w-amp.

jbowens commented 2 years ago

So, definitely iterator, not snapshot? Maybe we're leaking an iterator when the context cancels, let me do a quick check.

Yeah, for ztbl it's iterator (including both iterators reading from a snapshot and ones that are not)

jbowens commented 2 years ago

It is possible ~90 G of zombie tables is not unusual for this roachtest, not sure. It is a very significant proportion of the overall available disk capacity though.

tbg commented 2 years ago

Anecdotally, when the test passes cleanly (post https://github.com/cockroachdb/cockroach/pull/75448) , nodes have well over >100GiB available at all times.

tbg commented 2 years ago

It seems pretty conclusive based on the ztbl reading and the graph here

https://github.com/cockroachdb/cockroach/issues/68303#issuecomment-1019082248

that we're just leaking an iterator with the context cancellation somewhere (correct me if I'm wrong please).

Can we start there? Doesn't pebble have a mode where it crashes on open iterators & prints a stack trace on Close()?