cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.91k stars 3.78k forks source link

kv/kvnemesis: TestKVNemesisMultiNode failed #131005

Open cockroach-teamcity opened 3 days ago

cockroach-teamcity commented 3 days ago

kv/kvnemesis.TestKVNemesisMultiNode failed with artifacts on release-23.1 @ 9d6bfa8c50584f8092dd43c8a139959a584fd72f:

=== RUN   TestKVNemesisMultiNode
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/1f42cf5be2fc021646bf9b2daf5eaef3/logTestKVNemesisMultiNode683581851
    test_log_scope.go:79: use -show-logs to present logs inline
    kvnemesis_test.go:180: seed: 9201876552453284660
    kvnemesis_test.go:124: kvnemesis logging to /artifacts/tmp/_tmp/1f42cf5be2fc021646bf9b2daf5eaef3/kvnemesis3556606767
    kvnemesis.go:165: error applying x.GetForUpdate(ctx, tk(3100587421328275948)) // WriteTooOldError: write for key /Table/100/"2b077fd50350f9ec" at timestamp 1726742753.752932905,0 too old; wrote at 1726742753.795305116,1: {<nil> 0 {0xc001a83320} <nil> 1726742753.934287619,0}
    kvnemesis.go:185: failures(verbose): /artifacts/tmp/_tmp/1f42cf5be2fc021646bf9b2daf5eaef3/kvnemesis3556606767/failures
        repro steps: /artifacts/tmp/_tmp/1f42cf5be2fc021646bf9b2daf5eaef3/kvnemesis3556606767/repro.go
        rangefeed KVs: /artifacts/tmp/_tmp/1f42cf5be2fc021646bf9b2daf5eaef3/kvnemesis3556606767/kvs-rangefeed.txt
        scan KVs: /artifacts/tmp/_tmp/1f42cf5be2fc021646bf9b2daf5eaef3/kvnemesis3556606767/kvs-scan.txt
    kvnemesis_test.go:207: 
            Error Trace:    github.com/cockroachdb/cockroach/pkg/kv/kvnemesis/pkg/kv/kvnemesis/kvnemesis_test.go:207
                                        github.com/cockroachdb/cockroach/pkg/kv/kvnemesis/pkg/kv/kvnemesis/kvnemesis_test.go:160
            Error:          Should be zero, but was 1
            Test:           TestKVNemesisMultiNode
            Messages:       kvnemesis detected failures
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/1f42cf5be2fc021646bf9b2daf5eaef3/logTestKVNemesisMultiNode683581851
--- FAIL: TestKVNemesisMultiNode (18.73s)

Parameters:

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

- #129194 kv/kvnemesis: TestKVNemesisMultiNode failed [C-test-failure O-robot P-3 T-kv X-duplicate branch-master no-test-failure-activity] - #118471 kv/kvnemesis: TestKVNemesisMultiNode failed [A-testing C-bug C-test-failure O-robot P-3 T-kv branch-release-23.2]

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-42333

tbg commented 2 days ago

I have questions! The problem here is that db.GetForUpdate returns a WriteTooOldError:

db1.GetForUpdate(ctx, tk(3100587421328275948)) // WriteTooOldError: write for key /Table/100/"2b077fd50350f9ec" at timestamp 1726742753.752932905,0 too old; wrote at 1726742753.795305116,1
  // ^-- trace in: /artifacts/tmp/_tmp/1f42cf5be2fc021646bf9b2daf5eaef3/kvnemesis3556606767/w3_step21_trace

https://gist.github.com/tbg/62285a4fefe06d2e4fae15d3a944471e

What I don't understand is how non-transactional GetForUpdate is even supposed to work - it was introduced in daf42d6b8c87cbd3c0cf831b0cd8f7b97fafd3f5 alongside a unit test, which is the only meaningful caller (I'm not going to count this). There are probably a few more non-transactional callers that instead call (*Batch).GetForUpdate on a non-transactional batch, which is also what kvnemesis did (even though it prints it as db.GetForUpdate).

Non-txn'al GetForUpdate seems strange - when are these locks released - and I suspect this got introduced by accident and we can just remove it.

I'll hold off on further steps until I've deliberated with @nvanbenschoten. Either way, this isn't going to block any releases, as it's not regressed in any way ever since introduction back in 2021.

@miraradeva kindly pointed me at prior discussion on this on (internal) Slack here.