cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.05k stars 3.8k forks source link

roachtest: engine/switch/nodes=3 failed #45811

Closed cockroach-teamcity closed 4 years ago

cockroach-teamcity commented 4 years ago

(roachtest).engine/switch/nodes=3 failed on master@954fe69d554162aec0fbc001aad1fe5103d8df13:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200306-1790595/engine/switch/nodes=3/run_1
    test_runner.go:756: test timed out (10h0m0s)

    cluster.go:1935,engine_switch.go:129,engine_switch.go:140,test_runner.go:741: error with attached stack trace:
            main.(*cluster).RunE
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2010
            main.(*cluster).Run
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
            main.registerEngineSwitch.func1
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/engine_switch.go:129
            main.registerEngineSwitch.func2
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/engine_switch.go:140
            main.(*testRunner).runTest.func2
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:741
            runtime.goexit
                /usr/local/go/src/runtime/asm_amd64.s:1357
          - error with embedded safe details: output in %s
            -- arg 1: <string>
          - output in run_082446.331_n4_workload_check_tpcc:
          - error with attached stack trace:
            main.execCmd
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
            main.(*cluster).RunL
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
            main.(*cluster).RunE
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
            main.(*cluster).Run
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
            main.registerEngineSwitch.func1
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/engine_switch.go:129
            main.registerEngineSwitch.func2
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/engine_switch.go:140
            main.(*testRunner).runTest.func2
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:741
            runtime.goexit
                /usr/local/go/src/runtime/asm_amd64.s:1357
          - error with embedded safe details: %s returned:
            stderr:
            %s
            stdout:
            %s
            -- arg 1: <string>
            -- arg 2: <string>
            -- arg 3: <string>
          - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1790595-1583481796-02-n4cpu4:4 -- ./workload check tpcc --warehouses=1 --expensive-checks=true {pgurl:1} returned:
            stderr:

            stdout::
          - context canceled
More

Artifacts: [/engine/switch/nodes=3](https://teamcity.cockroachdb.com/viewLog.html?buildId=1790595&tab=artifacts#/engine/switch/nodes=3) [See this test on roachdash](https://roachdash.crdb.dev/?filter=status%3Aopen+t%3A.%2Aengine%2Fswitch%2Fnodes%3D3.%2A&sort=title&restgroup=false&display=lastcommented+project) powered by [pkg/cmd/internal/issues](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/internal/issues)

cockroach-teamcity commented 4 years ago

(roachtest).engine/switch/nodes=3 failed on master@752dea867f3aeb142e98c22f8d320ce19041aa8d:

            The Cockroach Labs team appreciates your feedback.
            E200307 08:13:16.183568 1 cli/error.go:373  ERROR: exit status 255
            ERROR: exit status 255
            Failed running "start"

            github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func2
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:293
            github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1583
            runtime.goexit
                /usr/local/go/src/runtime/asm_amd64.s:1357: 
            I200307 08:13:16.229079 1 cluster_synced.go:1665  command failed:
          - exit status 1

    cluster.go:2368,engine_switch.go:126,engine_switch.go:140,test_runner.go:741: error with attached stack trace:
            main.(*monitor).WaitE
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2356
            main.(*monitor).Wait
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2364
            main.registerEngineSwitch.func1
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/engine_switch.go:126
            main.registerEngineSwitch.func2
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/engine_switch.go:140
            main.(*testRunner).runTest.func2
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:741
            runtime.goexit
                /usr/local/go/src/runtime/asm_amd64.s:1357
          - monitor failure:
          - error with attached stack trace:
            main.(*monitor).wait.func2
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2412
            runtime.goexit
                /usr/local/go/src/runtime/asm_amd64.s:1357
          - monitor task failed:
          - error with attached stack trace:
            main.init
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2309
            runtime.doInit
                /usr/local/go/src/runtime/proc.go:5222
            runtime.main
                /usr/local/go/src/runtime/proc.go:190
            runtime.goexit
                /usr/local/go/src/runtime/asm_amd64.s:1357
          - Goexit() was called

    cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:778: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1793930-1583568484-12-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
        1: dead
        3: 4177
        2: 4210
        Error:  1: dead
More

Artifacts: [/engine/switch/nodes=3](https://teamcity.cockroachdb.com/viewLog.html?buildId=1793930&tab=artifacts#/engine/switch/nodes=3) [See this test on roachdash](https://roachdash.crdb.dev/?filter=status%3Aopen+t%3A.%2Aengine%2Fswitch%2Fnodes%3D3.%2A&sort=title&restgroup=false&display=lastcommented+project) powered by [pkg/cmd/internal/issues](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/internal/issues)

petermattis commented 4 years ago

@sumeerbhola Another replica consistency issue. I just skimmed the failure and it looks like all of the differences are in timeseries data. 752dea867f3aeb142e98c22f8d320ce19041aa8d does contain your fix though, so this is something different.

cockroach-teamcity commented 4 years ago

(roachtest).engine/switch/nodes=3 failed on master@c473f40078994551cebcbe00fdbf1fa388957658:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200309-1796240/engine/switch/nodes=3/run_1
    test_runner.go:756: test timed out (10h0m0s)

    cluster.go:1935,engine_switch.go:129,engine_switch.go:140,test_runner.go:741: error with attached stack trace:
            main.(*cluster).RunE
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2010
            main.(*cluster).Run
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
            main.registerEngineSwitch.func1
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/engine_switch.go:129
            main.registerEngineSwitch.func2
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/engine_switch.go:140
            main.(*testRunner).runTest.func2
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:741
            runtime.goexit
                /usr/local/go/src/runtime/asm_amd64.s:1357
          - error with embedded safe details: output in %s
            -- arg 1: <string>
          - output in run_074057.459_n4_workload_check_tpcc:
          - error with attached stack trace:
            main.execCmd
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
            main.(*cluster).RunL
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
            main.(*cluster).RunE
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
            main.(*cluster).Run
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1933
            main.registerEngineSwitch.func1
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/engine_switch.go:129
            main.registerEngineSwitch.func2
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/engine_switch.go:140
            main.(*testRunner).runTest.func2
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:741
            runtime.goexit
                /usr/local/go/src/runtime/asm_amd64.s:1357
          - error with embedded safe details: %s returned:
            stderr:
            %s
            stdout:
            %s
            -- arg 1: <string>
            -- arg 2: <string>
            -- arg 3: <string>
          - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1796240-1583738442-15-n4cpu4:4 -- ./workload check tpcc --warehouses=1 --expensive-checks=true {pgurl:1} returned:
            stderr:

            stdout::
          - context canceled
More

Artifacts: [/engine/switch/nodes=3](https://teamcity.cockroachdb.com/viewLog.html?buildId=1796240&tab=artifacts#/engine/switch/nodes=3) [See this test on roachdash](https://roachdash.crdb.dev/?filter=status%3Aopen+t%3A.%2Aengine%2Fswitch%2Fnodes%3D3.%2A&sort=title&restgroup=false&display=lastcommented+project) powered by [pkg/cmd/internal/issues](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/internal/issues)

sumeerbhola commented 4 years ago

The unmarshaled MVCCMetadata protos are equal. The difference is in the serialized bytes -- the Pebble one has a prefix of 120408001000180020002800. This corresponds to tags 2, 3, 4, 5 in the proto: https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/enginepb/mvcc.proto#L37-L43 These are not being explicitly set by Pebble. I wrote a small test -- serializing an MVCCMetadata in Go where these fields are not being explicitly set does produce bytes for these fields. The C++ serialization must not be producing bytes for these optional fields.

petermattis commented 4 years ago

Huh. I thought Go proto encoding elided optional fields which are the zero value.

petermattis commented 4 years ago

Huh. I thought Go proto encoding elided optional fields which are the zero value.

Or maybe that is only true of primitive fields. Hrmm...

petermattis commented 4 years ago

Or maybe that is only true of primitive fields. Hrmm...

Not even true for primitive fields:

    dAtA[i] = 0x18
    i++
    if m.Deleted {
        dAtA[i] = 1
    } else {
        dAtA[i] = 0
    }

This is all tickling something in the back of my head. I think there might be a difference here with the reflection-based marshaling. Gah! Ensuring we're exactly the same with C++ is quite the headache.

petermattis commented 4 years ago

Ugh, this gets worse. We currently marshal MVCCMetadata below Raft. Which means we can't change how that encoding is done on the Go side without causing other problems. See kv/kvserver/below_raft_protos_test.go.

sumeerbhola commented 4 years ago

One hack would be to change MVCCValueMerger to have special purpose code to marshal an MVCCMetadata proto that only contains the raw_bytes and merge_timestamp fields.

petermattis commented 4 years ago

I suspect some sort of hack will definitely be necessary. What about marshaling of InternalTimeSeriesData? I see that is also done in pebble_merger.go.

petermattis commented 4 years ago

This is also pointing out a hole in our unit testing. I believe we have unit tests for this merge operator, but we should also have a test that compares the byte output of the RocksDB and Pebble versions.

cockroach-teamcity commented 4 years ago

(roachtest).engine/switch/nodes=3 failed on master@72c4a1bd411f2f82bf9aaa22883821a946614148:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200310-1799071/engine/switch/nodes=3/run_1
    cluster.go:2368,engine_switch.go:126,engine_switch.go:140,test_runner.go:747: error with attached stack trace:
            main.(*monitor).WaitE
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2356
            main.(*monitor).Wait
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2364
            main.registerEngineSwitch.func1
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/engine_switch.go:126
            main.registerEngineSwitch.func2
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/engine_switch.go:140
            main.(*testRunner).runTest.func2
                /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:747
            runtime.goexit
                /usr/local/go/src/runtime/asm_amd64.s:1357
          - monitor failure:
          - unexpected node event: 3: dead

    cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:801: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1799071-1583824469-05-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
        2: 5059
        1: 5185
        3: dead
        Error:  3: dead
More

Artifacts: [/engine/switch/nodes=3](https://teamcity.cockroachdb.com/viewLog.html?buildId=1799071&tab=artifacts#/engine/switch/nodes=3) [See this test on roachdash](https://roachdash.crdb.dev/?filter=status%3Aopen+t%3A.%2Aengine%2Fswitch%2Fnodes%3D3.%2A&sort=title&restgroup=false&display=lastcommented+project) powered by [pkg/cmd/internal/issues](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/internal/issues)

sumeerbhola commented 4 years ago

What about marshaling of InternalTimeSeriesData?

The only optional fields are start_timestamp_nanos and sample_duration_nanos, and it seems we always set them.