Closed cockroach-teamcity closed 4 years ago
ERROR: cockroach server exited with error: store <no-attributes>=/mnt/data1/cockroach, last used with cockroach version v19.1-1, is too old for running version v20.1-21 (which requires data from v20.1 or later)
Can confirm what Nathan saw over in #54908. We start the cluster with 20.1.4 (no fixture, so this is a fresh cluster)wait for v20.1 to be "acked" by the cluster, then restart n4 into the 20.2 binary and it complains since its store has only 19.1-1 persisted. Looks like a real problem and I'll try to make the reproduction more specific.
Also, interestingly there's nothing specific about 20.2 here. This is behavior in 20.1 - the 20.2 binary is merely the one pointing out that the cluster version that should have been persisted by the 20.1 binary isn't there.
Hmm, I am looking at v20.1.4 and it seems fine? The code is
And also the code is idempotent (i.e. the second call will never regress the persisted version). Also, we handle the SQL listener only later, in
so the check the roachtest is running to determine that the update is complete should not pass until all stores have 20.1 persisted. Time to take a closer look.
Oh - I think I understand a little bit about what's going on. So 19.1-1 is the MinSupportedVersion of the 20.1 binary. The way this gets written is that the nodes first start up with empty engines and this call
results in cv == v19.1-1
. This then gets fed to the cluster setting at the very top of (*Node).start
:
At this point, the cluster version is not hooked up to the callback that persists stuff to the stores yet. We continue through (*Node).start
and towards the end hit the familiar code
in which the second part does persist the v19.1-1
to disk.
So all nodes should basically sit around with only the v19.1-1 cluster version on disk and nothing else (like Nathan is seeing on the other issue: https://github.com/cockroachdb/cockroach/issues/54908#issuecomment-701772330).
But as of this cycle, roachprod
will init
the first node, which will go here with bootstrapVersion() == v20.1
:
The result is that n1 sets itself up fully at 20.1, and gossips that cluster version out.
This doesn't change the end result though: it looks like one of the other nodes manages to give a 20.1 version on the SQL shell without having persisted the incoming gossip update. However, the gossip update should have been persisted in the second to last snippet above, which we should have passed. I don't know. Looks like I have to patch 20.1 and try to find something I said above that isn't true.
I added the following patch to v20.1.4:
diff --git a/pkg/server/server.go b/pkg/server/server.go
index 9e66bbf66b..7856b6f8b3 100644
--- a/pkg/server/server.go
+++ b/pkg/server/server.go
@@ -1552,6 +1552,10 @@ func (s *Server) Start(ctx context.Context) error {
return errors.Wrap(err, "inspecting engines")
}
+ if len(bootstrappedEngines) > 0 && cv.Version != s.cfg.Settings.Version.BinaryVersion() {
+ return errors.Errorf("have %v but version is %s, not %s", bootstrappedEngines, cv.Version, s.cfg.Settings.Version.BinaryVersion())
+ }
+
// Record a walltime that is lower than the lowest hlc timestamp this current
// instance of the node can use. We do not use startTime because it is lower
// than the timestamp used to create the bootstrap schema.
and only these changed steps in the roachtest (from master):
u := newVersionUpgradeTest(c,
uploadVersion(allNodes, mainVersion),
startVersion(allNodes, mainVersion),
waitForUpgradeStep(allNodes),
func(ctx context.Context, t *test, u *versionUpgradeTest) {
u.c.Stop(ctx)
u.c.Start(ctx, t)
},
I believe this covers the sequence of events that leads to the crash. However - at least locally I'm not readily able to repro this. Might be time to take a step back and repro the original thing, like Nathan did.
According to the CRDB logs, when 20.1.4 gets killed, it has indeed gone way past node.start, as it obviously needed to in order to serve SQL connections. Everything here is pointing to a bug either in how
works, or clusterversion.SetBeforeChange
isn't actually working as advertised in all of the cases.
Yay! It does repro with my boiled down single binary repro:
E201002 14:04:51.684348 1 cli/error.go:373 ERROR: cockroach server exited with error: have [<no-attributes>=/mnt/data1/cockroach] but version is 19.1-1, not 20.1
Time to turn off my brain and to indulge in some printf debugging.
BTW this took approx ~30 runs. The runs are only a few minutes in total so it's doable
I got another repro. Eerily often (five out of five), I see that n4 is pinned for the upgrade when it fails. I wonder if this is a simple one-off in the roachtest version check.
Heh, wishful thinking, squashed right away. My printfs tell a different story:
W201002 14:29:08.703467 43 clusterversion/setting.go:113 [n?] TBG version setting initialized to 19.1-1
TBG setting version via gossipI201002 14:29:08.703481 43 server/node.go:649 [n?] connecting to gossip network to verify cluster ID...
W201002 14:29:08.720968 43 clusterversion/setting.go:166 [n4] TBG setBeforeChange
W201002 14:29:08.721177 43 kv/kvserver/stores.go:311 [n4] TBG wrote 19.1-1 to <no-attributes>=/mnt/data1/cockroach
W201002 14:29:08.721314 43 kv/kvserver/stores.go:311 [n4] TBG wrote 19.1-1 to <no-attributes>=/mnt/data1/cockroach
W201002 14:29:08.723202 209 clusterversion/setting.go:267 TBG BeforeChange called with 20.1, cb=0x30ca110
W201002 14:29:08.723405 209 kv/kvserver/stores.go:311 TBG wrote 20.1 to <no-attributes>=/mnt/data1/cockroach
W201002 14:29:08.723554 209 kv/kvserver/stores.go:311 TBG wrote 20.1 to <no-attributes>=/mnt/data1/cockroach
W201002 14:29:08.733935 220 sql/show_cluster_setting.go:87 [n4,intExec=show-version] TBG show cluster setting = 20.1
W201002 14:29:10.204854 276 sql/show_cluster_setting.go:87 [n4,client=80.142.226.162:54864,hostnossl,user=root] TBG show cluster setting = 20.1
Either the storage engine drops the 20.1 or there is a race and we're in fact clobbering these writes for 19.1 and 20.1 (they're kind of close together, but still.
Hrm. This was going so well, and now without any change that I can pinpoint I'm 100+ iters in without a repro.
Finally got another reproduction. It seems to confirm what I said before. We are bumping the setting to 20.1, then restarting (goroutine 66 is the first log line after restart), and then reading 19.1-1 from disk even though we've previously persisted 20.1.
W201002 15:37:55.775407 16 clusterversion/setting.go:113 [n?] TBG version setting initialized to 19.1-1
TBG setting version via gossipI201002 15:37:55.775451 16 server/node.go:669 [n?] node connected via gossip and verified as part of cluster "3abf37d4-a236-4020-88f9-775731ff27a5"
W201002 15:37:55.789645 16 kv/kvserver/store.go:2633 TBG read 19.1-1 from <no-attributes>=/mnt/data1/cockroach
W201002 15:37:55.790053 16 clusterversion/setting.go:166 [n4] TBG setBeforeChange
W201002 15:37:55.790091 16 kv/kvserver/store.go:2633 [n4] TBG read 19.1-1 from <no-attributes>=/mnt/data1/cockroach
W201002 15:37:55.790227 16 kv/kvserver/stores.go:311 [n4] TBG wrote 19.1-1 to <no-attributes>=/mnt/data1/cockroach
W201002 15:37:55.790344 16 kv/kvserver/stores.go:311 [n4] TBG wrote 19.1-1 to <no-attributes>=/mnt/data1/cockroach
W201002 15:37:55.792531 189 clusterversion/setting.go:267 TBG BeforeChange called with 20.1, cb=0x30ca230
W201002 15:37:55.792603 189 kv/kvserver/store.go:2633 TBG read 19.1-1 from <no-attributes>=/mnt/data1/cockroach
W201002 15:37:55.792786 189 kv/kvserver/stores.go:311 TBG wrote 20.1 to <no-attributes>=/mnt/data1/cockroach
W201002 15:37:55.792917 189 kv/kvserver/stores.go:311 TBG wrote 20.1 to <no-attributes>=/mnt/data1/cockroach
W201002 15:37:55.803241 127 sql/show_cluster_setting.go:87 [n4,intExec=show-version] TBG show cluster setting = 20.1
W201002 15:37:57.277901 256 sql/show_cluster_setting.go:87 [n4,client=80.142.226.162:56580,hostnossl,user=root] TBG show cluster setting = 20.1
W201002 15:38:03.298722 66 kv/kvserver/store.go:2633 [n?] TBG read 19.1-1 from <no-attributes>=/mnt/data1/cockroach
W201002 15:38:03.298804 66 kv/kvserver/stores.go:311 [n?] TBG wrote 19.1-1 to <no-attributes>=/mnt/data1/cockroach
W201002 15:38:03.790508 66 kv/kvserver/store.go:2633 [n?] TBG read 19.1-1 from <no-attributes>=/mnt/data1/cockroach
W201002 15:38:03.790627 66 kv/kvserver/stores.go:311 [n?] TBG wrote 19.1-1 to <no-attributes>=/mnt/data1/cockroach
The logs here are a bit messy because the cluster setting methods often "write twice" to make sure that in the presence of multiple engines, the version read from one is always written back to them all. (There's only one engine here, but still). The narrated story is basically:
@petermattis I have to ask - how unlikely is it that there's a bug here where RocksDB drops a write? We are killing the process very close to when this write occurs.
The logs don't leave much room for other explanations. (This is 20.1.4, btw). Can I rule this out (or confirm it) via COCKROACH_STORAGE_ENGINE
?
my 20.1 binary code is here; https://github.com/cockroachdb/cockroach/compare/v20.1.4...tbg:release-20.1-repro-replicate-wide roachtest is here: https://github.com/cockroachdb/cockroach/compare/master...tbg:replicate-wide?expand=1
I was invoking via
./bin/roachtest run decommission/mixed-versions --cockroach ~/go-20.1/src/github.com/cockroachdb/cockroach/cockroach-linux-2.6.32-gnu-amd64 --workload bin/workload --roachprod bin/roachprod --count 30
and most recently needed 4-5 of these to get a single failure.
@petermattis I have to ask - how unlikely is it that there's a bug here where RocksDB drops a write? We are killing the process very close to when this write occurs. The logs don't leave much room for other explanations. (This is 20.1.4, btw). Can I rule this out (or confirm it) via COCKROACH_STORAGE_ENGINE?
I think it is very unlikely that RocksDB is dropping a sync'd write operation. But as I wrote this I wondered if we're actually marking the operation as synced. Looks like the answer is no!
DBStatus DBImpl::Put(DBKey key, DBSlice value) {
rocksdb::WriteOptions options;
return ToDBStatus(rep->Put(options, EncodeKey(key), ToSlice(value)));
}
The default for WriteOptions.sync
is false
. Same problem exists for Merge
, Delete
, SingleDelete
, and DeleteRange
. Note that this only affects operations that are being performed directly on a storage.RocksDB
. When using a batch we are explicitly passing a parameter indicating whether we want syncing or not.
I think this explains what is going on here. If you want to verify this theory, change DBImpl::Put
to specify options.sync = true
.
How often are we using a raw storage.RocksDB
to perform write operations? It is certainly true of WriteClusterVersionToEngines
.
A failure to sync the WAL during this bootstrapping logic would explain why I only saw two keys in the LSM over in https://github.com/cockroachdb/cockroach/issues/54908#issuecomment-701772330.
I'm out for the night, but I'm glad we're closing in here. I won't be back until Tuesday, so if anyone wants to seal the deal - all yours.
(roachtest).decommission/mixed-versions failed on master@602aa34aca935b5dc17e2cde6c71b672866ec770:
The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/mixed-versions/run_1
mixed_version_decommission.go:248,versionupgrade.go:190,mixed_version_decommission.go:121,decommission.go:73,test_runner.go:755: expected to find 1 node with membership=decommissioning, found 0
(1) attached stack trace
-- stack trace:
| main.checkOneMembership.func1.1
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/mixed_version_decommission.go:237
| github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:197
| main.checkOneMembership.func1
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/mixed_version_decommission.go:228
| main.(*versionUpgradeTest).run
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/versionupgrade.go:190
| main.runDecommissionMixedVersions
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/mixed_version_decommission.go:121
| main.registerDecommission.func3
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/decommission.go:73
| main.(*testRunner).runTest.func2
| /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:755
| runtime.goexit
| /usr/local/go/src/runtime/asm_amd64.s:1357
Wraps: (2) expected to find 1 node with membership=decommissioning, found 0
Error types: (1) *withstack.withStack (2) *errutil.leafError
Artifacts: [/decommission/mixed-versions](https://teamcity.cockroachdb.com/viewLog.html?buildId=2341039&tab=artifacts#/decommission/mixed-versions) Related: - #54908 roachtest: decommission/mixed-versions failed [C-test-failure](https://api.github.com/repos/cockroachdb/cockroach/labels/C-test-failure) [O-roachtest](https://api.github.com/repos/cockroachdb/cockroach/labels/O-roachtest) [O-robot](https://api.github.com/repos/cockroachdb/cockroach/labels/O-robot) [branch-release-20.2](https://api.github.com/repos/cockroachdb/cockroach/labels/branch-release-20.2) [release-blocker](https://api.github.com/repos/cockroachdb/cockroach/labels/release-blocker) [See this test on roachdash](https://roachdash.crdb.dev/?filter=status%3Aopen+t%3A.%2Adecommission%2Fmixed-versions.%2A&sort=title&restgroup=false&display=lastcommented+project) powered by [pkg/cmd/internal/issues](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/internal/issues)
(roachtest).decommission/mixed-versions failed on master@0c12d8b101a853f441926ad758c4c2882ecd33ee:
More
Artifacts: [/decommission/mixed-versions](https://teamcity.cockroachdb.com/viewLog.html?buildId=2328528&tab=artifacts#/decommission/mixed-versions) [See this test on roachdash](https://roachdash.crdb.dev/?filter=status%3Aopen+t%3A.%2Adecommission%2Fmixed-versions.%2A&sort=title&restgroup=false&display=lastcommented+project) powered by [pkg/cmd/internal/issues](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/internal/issues)