cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.04k stars 3.8k forks source link

roachtest: decommission/mixed-versions failed #54906

Closed cockroach-teamcity closed 4 years ago

cockroach-teamcity commented 4 years ago

(roachtest).decommission/mixed-versions failed on master@0c12d8b101a853f441926ad758c4c2882ecd33ee:

          |   | *
          |   | * WARNING: RUNNING IN INSECURE MODE!
          |   | * 
          |   | * - Your cluster is open for any client that can access <all your IP addresses>.
          |   | * - Any user, even root, can log in without providing a password.
          |   | * - Any user, connecting as root, can read or write any data in your cluster.
          |   | * - There is no network encryption nor authentication, and thus no confidentiality.
          |   | * 
          |   | * Check out how to secure your cluster: https://www.cockroachlabs.com/docs/v20.2/secure-a-cluster.html
          |   | *
          |   | *
          |   | * ERROR: ERROR: cockroach server exited with error: store <no-attributes>=/mnt/data1/cockroach, last used with cockroach version v19.1-1, is too old for running version v20.1-21 (which requires data from v20.1 or later)
          |   | *
          |   | ERROR: cockroach server exited with error: store <no-attributes>=/mnt/data1/cockroach, last used with cockroach version v19.1-1, is too old for running version v20.1-21 (which requires data from v20.1 or later)
          |   | Failed running "start"
          |   | E200929 06:22:50.736037 1 cli/error.go:398  ERROR: exit status 1
          |   | ERROR: exit status 1
          |   | Failed running "start"
          | Wraps: (3) exit status 1
          | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *exec.ExitError: 
          | I200929 06:22:50.776718 1 cluster_synced.go:1733  command failed
        Wraps: (2) exit status 1
        Error types: (1) *main.withCommandDetails (2) *exec.ExitError

    cluster.go:1654,context.go:135,cluster.go:1643,test_runner.go:824: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2328528-1601360230-09-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 2: 4721
        3: 4639
        4: dead
        1: 4783
        Error: UNCLASSIFIED_PROBLEM: 4: dead
        (1) UNCLASSIFIED_PROBLEM
        Wraps: (2) attached stack trace
          -- stack trace:
          | main.glob..func14
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
          | main.wrap.func1
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
          | github.com/spf13/cobra.(*Command).execute
          |     /home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
          | github.com/spf13/cobra.(*Command).ExecuteC
          |     /home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
          | github.com/spf13/cobra.(*Command).Execute
          |     /home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
          | main.main
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
          | runtime.main
          |     /usr/local/go/src/runtime/proc.go:203
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1357
        Wraps: (3) 4: dead
        Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
More

Artifacts: [/decommission/mixed-versions](https://teamcity.cockroachdb.com/viewLog.html?buildId=2328528&tab=artifacts#/decommission/mixed-versions) [See this test on roachdash](https://roachdash.crdb.dev/?filter=status%3Aopen+t%3A.%2Adecommission%2Fmixed-versions.%2A&sort=title&restgroup=false&display=lastcommented+project) powered by [pkg/cmd/internal/issues](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/internal/issues)

nvanbenschoten commented 4 years ago
ERROR: cockroach server exited with error: store <no-attributes>=/mnt/data1/cockroach, last used with cockroach version v19.1-1, is too old for running version v20.1-21 (which requires data from v20.1 or later)
tbg commented 4 years ago

Can confirm what Nathan saw over in #54908. We start the cluster with 20.1.4 (no fixture, so this is a fresh cluster)wait for v20.1 to be "acked" by the cluster, then restart n4 into the 20.2 binary and it complains since its store has only 19.1-1 persisted. Looks like a real problem and I'll try to make the reproduction more specific.

tbg commented 4 years ago

Also, interestingly there's nothing specific about 20.2 here. This is behavior in 20.1 - the 20.2 binary is merely the one pointing out that the cluster version that should have been persisted by the 20.1 binary isn't there.

tbg commented 4 years ago

Hmm, I am looking at v20.1.4 and it seems fine? The code is

https://github.com/cockroachdb/cockroach/blob/b9ed5dfa3772d36fb548d52d5093022ce10411a2/pkg/server/node.go#L492-L501

And also the code is idempotent (i.e. the second call will never regress the persisted version). Also, we handle the SQL listener only later, in

https://github.com/cockroachdb/cockroach/blob/eb054ecf72575951ab899647d1ab4673d0ad0971/pkg/server/server.go#L1626

so the check the roachtest is running to determine that the update is complete should not pass until all stores have 20.1 persisted. Time to take a closer look.

tbg commented 4 years ago

Oh - I think I understand a little bit about what's going on. So 19.1-1 is the MinSupportedVersion of the 20.1 binary. The way this gets written is that the nodes first start up with empty engines and this call

https://github.com/cockroachdb/cockroach/blob/eb054ecf72575951ab899647d1ab4673d0ad0971/pkg/server/server.go#L1546-L1553

results in cv == v19.1-1. This then gets fed to the cluster setting at the very top of (*Node).start:

https://github.com/cockroachdb/cockroach/blob/b9ed5dfa3772d36fb548d52d5093022ce10411a2/pkg/server/node.go#L350-L352

At this point, the cluster version is not hooked up to the callback that persists stuff to the stores yet. We continue through (*Node).start and towards the end hit the familiar code

https://github.com/cockroachdb/cockroach/blob/b9ed5dfa3772d36fb548d52d5093022ce10411a2/pkg/server/node.go#L497-L501

in which the second part does persist the v19.1-1 to disk.

So all nodes should basically sit around with only the v19.1-1 cluster version on disk and nothing else (like Nathan is seeing on the other issue: https://github.com/cockroachdb/cockroach/issues/54908#issuecomment-701772330).

But as of this cycle, roachprod will init the first node, which will go here with bootstrapVersion() == v20.1:

https://github.com/cockroachdb/cockroach/blob/eb054ecf72575951ab899647d1ab4673d0ad0971/pkg/server/server.go#L1533-L1538

The result is that n1 sets itself up fully at 20.1, and gossips that cluster version out.

This doesn't change the end result though: it looks like one of the other nodes manages to give a 20.1 version on the SQL shell without having persisted the incoming gossip update. However, the gossip update should have been persisted in the second to last snippet above, which we should have passed. I don't know. Looks like I have to patch 20.1 and try to find something I said above that isn't true.

tbg commented 4 years ago

I added the following patch to v20.1.4:

diff --git a/pkg/server/server.go b/pkg/server/server.go
index 9e66bbf66b..7856b6f8b3 100644
--- a/pkg/server/server.go
+++ b/pkg/server/server.go
@@ -1552,6 +1552,10 @@ func (s *Server) Start(ctx context.Context) error {
                return errors.Wrap(err, "inspecting engines")
        }

+       if len(bootstrappedEngines) > 0 && cv.Version != s.cfg.Settings.Version.BinaryVersion() {
+               return errors.Errorf("have %v but version is %s, not %s", bootstrappedEngines, cv.Version, s.cfg.Settings.Version.BinaryVersion())
+       }
+
        // Record a walltime that is lower than the lowest hlc timestamp this current
        // instance of the node can use. We do not use startTime because it is lower
        // than the timestamp used to create the bootstrap schema.

and only these changed steps in the roachtest (from master):

    u := newVersionUpgradeTest(c,
        uploadVersion(allNodes, mainVersion),
        startVersion(allNodes, mainVersion),
        waitForUpgradeStep(allNodes),
        func(ctx context.Context, t *test, u *versionUpgradeTest) {
            u.c.Stop(ctx)
            u.c.Start(ctx, t)
        },

I believe this covers the sequence of events that leads to the crash. However - at least locally I'm not readily able to repro this. Might be time to take a step back and repro the original thing, like Nathan did.

tbg commented 4 years ago

According to the CRDB logs, when 20.1.4 gets killed, it has indeed gone way past node.start, as it obviously needed to in order to serve SQL connections. Everything here is pointing to a bug either in how

https://github.com/cockroachdb/cockroach/blob/b9ed5dfa3772d36fb548d52d5093022ce10411a2/pkg/server/node.go#L497-L501

works, or clusterversion.SetBeforeChange isn't actually working as advertised in all of the cases.

tbg commented 4 years ago

Yay! It does repro with my boiled down single binary repro:

E201002 14:04:51.684348 1 cli/error.go:373  ERROR: cockroach server exited with error: have [<no-attributes>=/mnt/data1/cockroach] but version is 19.1-1, not 20.1

Time to turn off my brain and to indulge in some printf debugging.

tbg commented 4 years ago

BTW this took approx ~30 runs. The runs are only a few minutes in total so it's doable

tbg commented 4 years ago

I got another repro. Eerily often (five out of five), I see that n4 is pinned for the upgrade when it fails. I wonder if this is a simple one-off in the roachtest version check.

tbg commented 4 years ago

Heh, wishful thinking, squashed right away. My printfs tell a different story:

W201002 14:29:08.703467 43 clusterversion/setting.go:113  [n?] TBG version setting initialized to 19.1-1
TBG setting version via gossipI201002 14:29:08.703481 43 server/node.go:649  [n?] connecting to gossip network to verify cluster ID...
W201002 14:29:08.720968 43 clusterversion/setting.go:166  [n4] TBG setBeforeChange
W201002 14:29:08.721177 43 kv/kvserver/stores.go:311  [n4] TBG wrote 19.1-1 to <no-attributes>=/mnt/data1/cockroach
W201002 14:29:08.721314 43 kv/kvserver/stores.go:311  [n4] TBG wrote 19.1-1 to <no-attributes>=/mnt/data1/cockroach
W201002 14:29:08.723202 209 clusterversion/setting.go:267  TBG BeforeChange called with 20.1, cb=0x30ca110
W201002 14:29:08.723405 209 kv/kvserver/stores.go:311  TBG wrote 20.1 to <no-attributes>=/mnt/data1/cockroach
W201002 14:29:08.723554 209 kv/kvserver/stores.go:311  TBG wrote 20.1 to <no-attributes>=/mnt/data1/cockroach
W201002 14:29:08.733935 220 sql/show_cluster_setting.go:87  [n4,intExec=show-version] TBG show cluster setting = 20.1
W201002 14:29:10.204854 276 sql/show_cluster_setting.go:87  [n4,client=80.142.226.162:54864,hostnossl,user=root] TBG show cluster setting = 20.1

Either the storage engine drops the 20.1 or there is a race and we're in fact clobbering these writes for 19.1 and 20.1 (they're kind of close together, but still.

tbg commented 4 years ago

Hrm. This was going so well, and now without any change that I can pinpoint I'm 100+ iters in without a repro.

tbg commented 4 years ago

Finally got another reproduction. It seems to confirm what I said before. We are bumping the setting to 20.1, then restarting (goroutine 66 is the first log line after restart), and then reading 19.1-1 from disk even though we've previously persisted 20.1.

W201002 15:37:55.775407 16 clusterversion/setting.go:113  [n?] TBG version setting initialized to 19.1-1
TBG setting version via gossipI201002 15:37:55.775451 16 server/node.go:669  [n?] node connected via gossip and verified as part of cluster "3abf37d4-a236-4020-88f9-775731ff27a5"
W201002 15:37:55.789645 16 kv/kvserver/store.go:2633  TBG read 19.1-1 from <no-attributes>=/mnt/data1/cockroach
W201002 15:37:55.790053 16 clusterversion/setting.go:166  [n4] TBG setBeforeChange
W201002 15:37:55.790091 16 kv/kvserver/store.go:2633  [n4] TBG read 19.1-1 from <no-attributes>=/mnt/data1/cockroach
W201002 15:37:55.790227 16 kv/kvserver/stores.go:311  [n4] TBG wrote 19.1-1 to <no-attributes>=/mnt/data1/cockroach
W201002 15:37:55.790344 16 kv/kvserver/stores.go:311  [n4] TBG wrote 19.1-1 to <no-attributes>=/mnt/data1/cockroach
W201002 15:37:55.792531 189 clusterversion/setting.go:267  TBG BeforeChange called with 20.1, cb=0x30ca230
W201002 15:37:55.792603 189 kv/kvserver/store.go:2633  TBG read 19.1-1 from <no-attributes>=/mnt/data1/cockroach
W201002 15:37:55.792786 189 kv/kvserver/stores.go:311  TBG wrote 20.1 to <no-attributes>=/mnt/data1/cockroach
W201002 15:37:55.792917 189 kv/kvserver/stores.go:311  TBG wrote 20.1 to <no-attributes>=/mnt/data1/cockroach
W201002 15:37:55.803241 127 sql/show_cluster_setting.go:87  [n4,intExec=show-version] TBG show cluster setting = 20.1
W201002 15:37:57.277901 256 sql/show_cluster_setting.go:87  [n4,client=80.142.226.162:56580,hostnossl,user=root] TBG show cluster setting = 20.1
W201002 15:38:03.298722 66 kv/kvserver/store.go:2633  [n?] TBG read 19.1-1 from <no-attributes>=/mnt/data1/cockroach
W201002 15:38:03.298804 66 kv/kvserver/stores.go:311  [n?] TBG wrote 19.1-1 to <no-attributes>=/mnt/data1/cockroach
W201002 15:38:03.790508 66 kv/kvserver/store.go:2633  [n?] TBG read 19.1-1 from <no-attributes>=/mnt/data1/cockroach
W201002 15:38:03.790627 66 kv/kvserver/stores.go:311  [n?] TBG wrote 19.1-1 to <no-attributes>=/mnt/data1/cockroach

The logs here are a bit messy because the cluster setting methods often "write twice" to make sure that in the presence of multiple engines, the version read from one is always written back to them all. (There's only one engine here, but still). The narrated story is basically:

  1. version setting gets inited to 19.1-1
  2. bootstrapStores writes 19.1-1 to disk (I missed the printf for this, so this is invisible in the log but explains the next step)
  3. synthesizeclusterversion reads the 19.1-1 back (in node.start)
  4. setBeforeChange is populated
  5. callback is invoked manually, read-writes 19.1-1 (i.e. a noop)
  6. gossip update for 20.1 comes in, callback invoked again: reads 19.1-1, writes 20.1
  7. show cluster setting returns 20.1
  8. n4 gets killed and restarted
  9. synthesizeclusterversion reads 19.1-1 from disk (!) and the assertion i added kills the process (after 19.1-1 is put back to disk, etc)

@petermattis I have to ask - how unlikely is it that there's a bug here where RocksDB drops a write? We are killing the process very close to when this write occurs. The logs don't leave much room for other explanations. (This is 20.1.4, btw). Can I rule this out (or confirm it) via COCKROACH_STORAGE_ENGINE?

my 20.1 binary code is here; https://github.com/cockroachdb/cockroach/compare/v20.1.4...tbg:release-20.1-repro-replicate-wide roachtest is here: https://github.com/cockroachdb/cockroach/compare/master...tbg:replicate-wide?expand=1

I was invoking via


./bin/roachtest run decommission/mixed-versions --cockroach ~/go-20.1/src/github.com/cockroachdb/cockroach/cockroach-linux-2.6.32-gnu-amd64 --workload bin/workload --roachprod bin/roachprod --count 30

and most recently needed 4-5 of these to get a single failure.

petermattis commented 4 years ago

@petermattis I have to ask - how unlikely is it that there's a bug here where RocksDB drops a write? We are killing the process very close to when this write occurs. The logs don't leave much room for other explanations. (This is 20.1.4, btw). Can I rule this out (or confirm it) via COCKROACH_STORAGE_ENGINE?

I think it is very unlikely that RocksDB is dropping a sync'd write operation. But as I wrote this I wondered if we're actually marking the operation as synced. Looks like the answer is no!

DBStatus DBImpl::Put(DBKey key, DBSlice value) {
  rocksdb::WriteOptions options;
  return ToDBStatus(rep->Put(options, EncodeKey(key), ToSlice(value)));
}

The default for WriteOptions.sync is false. Same problem exists for Merge, Delete, SingleDelete, and DeleteRange. Note that this only affects operations that are being performed directly on a storage.RocksDB. When using a batch we are explicitly passing a parameter indicating whether we want syncing or not.

I think this explains what is going on here. If you want to verify this theory, change DBImpl::Put to specify options.sync = true.

How often are we using a raw storage.RocksDB to perform write operations? It is certainly true of WriteClusterVersionToEngines.

nvanbenschoten commented 4 years ago

A failure to sync the WAL during this bootstrapping logic would explain why I only saw two keys in the LSM over in https://github.com/cockroachdb/cockroach/issues/54908#issuecomment-701772330.

tbg commented 4 years ago

I'm out for the night, but I'm glad we're closing in here. I won't be back until Tuesday, so if anyone wants to seal the deal - all yours.

cockroach-teamcity commented 4 years ago

(roachtest).decommission/mixed-versions failed on master@602aa34aca935b5dc17e2cde6c71b672866ec770:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/mixed-versions/run_1
    mixed_version_decommission.go:248,versionupgrade.go:190,mixed_version_decommission.go:121,decommission.go:73,test_runner.go:755: expected to find 1 node with membership=decommissioning, found 0
        (1) attached stack trace
          -- stack trace:
          | main.checkOneMembership.func1.1
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/mixed_version_decommission.go:237
          | github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:197
          | main.checkOneMembership.func1
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/mixed_version_decommission.go:228
          | main.(*versionUpgradeTest).run
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/versionupgrade.go:190
          | main.runDecommissionMixedVersions
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/mixed_version_decommission.go:121
          | main.registerDecommission.func3
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/decommission.go:73
          | main.(*testRunner).runTest.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:755
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1357
        Wraps: (2) expected to find 1 node with membership=decommissioning, found 0
        Error types: (1) *withstack.withStack (2) *errutil.leafError
More

Artifacts: [/decommission/mixed-versions](https://teamcity.cockroachdb.com/viewLog.html?buildId=2341039&tab=artifacts#/decommission/mixed-versions) Related: - #54908 roachtest: decommission/mixed-versions failed [C-test-failure](https://api.github.com/repos/cockroachdb/cockroach/labels/C-test-failure) [O-roachtest](https://api.github.com/repos/cockroachdb/cockroach/labels/O-roachtest) [O-robot](https://api.github.com/repos/cockroachdb/cockroach/labels/O-robot) [branch-release-20.2](https://api.github.com/repos/cockroachdb/cockroach/labels/branch-release-20.2) [release-blocker](https://api.github.com/repos/cockroachdb/cockroach/labels/release-blocker) [See this test on roachdash](https://roachdash.crdb.dev/?filter=status%3Aopen+t%3A.%2Adecommission%2Fmixed-versions.%2A&sort=title&restgroup=false&display=lastcommented+project) powered by [pkg/cmd/internal/issues](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/internal/issues)