As observed on drt-large it can be confusing to users if they encounter unavailable ranges on a cluster which houses region survivable databases (which default to RF=5), when less than 3 nodes have failed. In the case uncovered on drt-large, the tsdb was responsible for the unavailable ranges, as it defaults to RF=3 (and ~doesn't have an accompanying zone config by default~ will have an explicit zone config after https://github.com/cockroachdb/cockroach/pull/127034 is merged).
If users have created all of their databases to be region survivable (or have otherwise increased the replication factor of them from 3 to some number larger than 3), it's likely that they'll also want their tsdb to be able to survive more than one failed node. This issue aims to investigate the cost of running tsdb with RF=5, to determine if we should make that the new default.
As observed on
drt-large
it can be confusing to users if they encounter unavailable ranges on a cluster which houses region survivable databases (which default to RF=5), when less than 3 nodes have failed. In the case uncovered ondrt-large
, the tsdb was responsible for the unavailable ranges, as it defaults to RF=3 (and ~doesn't have an accompanying zone config by default~ will have an explicit zone config after https://github.com/cockroachdb/cockroach/pull/127034 is merged).If users have created all of their databases to be region survivable (or have otherwise increased the replication factor of them from 3 to some number larger than 3), it's likely that they'll also want their tsdb to be able to survive more than one failed node. This issue aims to investigate the cost of running tsdb with RF=5, to determine if we should make that the new default.
Jira issue: CRDB-39275
Epic CRDB-39958