cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.01k stars 3.79k forks source link

sql/logictest: TestLogic_upsert failing due to hardware overload #119907

Closed cockroach-teamcity closed 6 months ago

cockroach-teamcity commented 7 months ago

pkg/sql/logictest/tests/local-vec-off/local-vec-off_test.TestLogic_upsert failed on master @ bf013ea0a5311726e65d37e8f047ce39ea2d5f10:

=== RUN   TestLogic_upsert
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestLogic_upsert280453658
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:144: cluster virtualization disabled in global scope due to issue: #76378 (expected label: C-bug)
[05:45:05] setting distsql_workmem='90583B';
    logic.go:4307: -- test log scope end --
test logs left over in: outputs.zip/logTestLogic_upsert280453658
--- FAIL: TestLogic_upsert (25.29s)
=== RUN   TestLogic_upsert/regression_35364
[05:45:09] --- progress: /var/lib/engflow/worker/work/1/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local-vec-off/local-vec-off_test_/local-vec-off_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upsert: 249 statements
[05:45:16] --- progress: /var/lib/engflow/worker/work/1/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local-vec-off/local-vec-off_test_/local-vec-off_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upsert: 260 statements
    logic.go:2964: 

        /var/lib/engflow/worker/work/1/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local-vec-off/local-vec-off_test_/local-vec-off_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upsert:1262: SELECT count(*) FROM t54456
        expected success, but found
        (XXUUU) context canceled
[05:45:25] --- progress: /var/lib/engflow/worker/work/1/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local-vec-off/local-vec-off_test_/local-vec-off_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upsert: 264 statements
    logic.go:2206: 
         /var/lib/engflow/worker/work/1/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local-vec-off/local-vec-off_test_/local-vec-off_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upsert:1267: too many errors encountered, skipping the rest of the input
[05:45:25] --- done: /var/lib/engflow/worker/work/1/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local-vec-off/local-vec-off_test_/local-vec-off_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upsert with config local-vec-off: 264 tests, 2 failures
[05:45:30] --- total progress: 264 statements
--- total: 264 tests, 2 failures
    --- FAIL: TestLogic_upsert/regression_35364 (18.62s)

Parameters:

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

Jira issue: CRDB-36379

yuzefovich commented 7 months ago

I'm guessing this should be fixed by #119953

rickystewart commented 7 months ago

Hi @yuzefovich, that PR has nothing to do with this failure. That PR is about race and deadlock tests. This is not one of those.

yuzefovich commented 7 months ago

Oops, misclick.

Still, this failure doesn't really seem actionable to me, here is what we have in logs:

I240305 05:45:18.244124 25810 5@util/log/event_log.go:32 ⋮ [T1,Vsystem,n1,client=127.0.0.1:34706,hostssl,user=root] 673 ={"Timestamp":1709617517308450268,"EventType":"create_table","Statement":"CREATE TABLE ‹test›.public.‹t54456› (‹c› INT8 PRIMARY KEY)","Tag":"CREATE TABLE","User":"root","DescriptorID":146,"TableName":"‹test.public.t54456›"}
I240305 05:45:24.391026 24556 kv/kvserver/queue.go:613 ⋮ [T1,Vsystem,n1,s1,r52/1:‹/Table/5{0-1}›,raft] 674  rate limited in MaybeAdd (merge): throttled on async limiting semaphore
...
E240305 05:45:25.801364 32955 sql/gcjob/table_garbage_collection.go:263 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=948787993275564033›] 676  delete range /Table/145 - /Table/146 failed: ‹context canceled›
E240305 05:45:25.801778 32955 jobs/registry.go:1637 ⋮ [T1,Vsystem,n1] 677  job 948787993275564033: running execution encountered retriable error: non-cancelable: attempted to delete table data: delete range /Table/145 - /Table/146: context canceled
...
E240305 05:45:25.801778 32955 jobs/registry.go:1637 ⋮ [T1,Vsystem,n1] 677 +Wraps: (8) context canceled
E240305 05:45:25.801778 32955 jobs/registry.go:1637 ⋮ [T1,Vsystem,n1] 677 +Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *markers.withMark (4) *withstack.withStack (5) *errutil.withPrefix (6) *withstack.withStack (7) *errutil.withPrefix (8) *errors.errorString
E240305 05:45:25.801968 32955 jobs/adopt.go:461 ⋮ [T1,Vsystem,n1] 678  job 948787993275564033: adoption completed with error non-cancelable: attempted to delete table data: delete range /Table/145 - /Table/146: context canceled
E240305 05:45:25.813995 27994 sql/logictest/logic.go:4431 ⋮ [-] 679 +‹/var/lib/engflow/worker/work/1/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local-vec-off/local-vec-off_test_/local-vec-off_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upsert:1262: SELECT count(*) FROM t54456›
E240305 05:45:25.813995 27994 sql/logictest/logic.go:4431 ⋮ [-] 679 +‹expected success, but found›

Table 145 is dropped right before we create t54456 (which is Table 146). This is local-vec-off config, so we're running only a single node. For some reason, there appear to be an interruption of unknown origin.

I have not seen this failure mode on logic tests in TeamCity, so I'm inclined to assume it's something about EngFlow environment.

DrewKimball commented 7 months ago

"CPUUserPercent":133.9293

DrewKimball commented 7 months ago

[05:45:05] setting distsql_workmem='90583B';

Maybe just due to metamorphic settings making things slow?

mgartner commented 7 months ago

We're seeing a few of these "context canceled" failures in the last few days. So it seems like there's some recent change to code or infrastructure that is causing this.

mgartner commented 6 months ago

It seems like this is a duplicate of #120395, correct?

yuzefovich commented 6 months ago

I think so, closing as a dup of #120395