Currently, the allocator simulator assesses the allocator’s behavior under
explicit and user-defined conditions. This could restrict the exposure of
complex scenarios.
This issue tracks work that needs to be done to integrate randomness into the
simulator framework, eliminating manual test setup and explicit assertions.
[x] Extend the gen_cluster command to support randomized parameters for zone
configurations (including voter and non-voter constraints, localities) and a random number of nodes and stores per node.
[x] Add randomness to initial node placement across clusters
[x] Extend gen_ranges, gen_load to support randomized parameters
[ ] Scope out what assertions would be interesting based on the random cluster
setup
[x] Ensure consistency on assertion evaluation
[ ] Extend set_liveness to support randomized node status change
[x] Currently, it is hard to verify stats without using a plot or assertion.
Simplifying this by generating text-based statistics would make it easier.
Future work:
Add new types of assertions for zone configuration, range distribution,
node liveness, load balance, and when multiple samples are run
Simulate network partition, high latency, load-related latency, and
randomized inter-region latency based on node proximity
Different gossip delay for different regions
Anticipated difficulties:
Test setup and flakiness: randomized testing could lead to complex and
unexpected test setups. It could be hard to validate the generated input.
Some ideas to help alleviate these issues:
Use a seed number for every randomization
Record test inputs for every failed test
Write unit tests on things to make them more deterministic
[ ] given a zone configuration and a cluster set up, return whether this zone configuration is satisfiable
this could be expensive if we want to run it on every configs generated
[ ] start with a common zone configurations and tweak a bit (some use cases: demote a node, promote a node, demote a voter from one region to another, demote a voter from one zone to another, promote a voter from one region to another, promote a voter from one zone to another, demote a non-voter from one region to another, demote a non-voter from one zone to another, promote a non-voter from one region to another, promote a non-voter from one zone to another )
[ ] add a case where we just cycle through a bunch of random events within a short period of time and check for final state conformance
[ ] random node liveness change (initially or during simulation)
[ ] add more nodes and localities -> satisfiable zone configurations should remain satisfiable
Idea that didn't make it:
explore all possible zone configurations given a cluster set up and run check on all possible configurations
this sounds nice, but it seems too aggressive + some configurations may be too rare to occur in real practice
generate constraints that do sum up to num_replicas + constraints that do not sum up to num_replicas
Currently, the allocator simulator assesses the allocator’s behavior under explicit and user-defined conditions. This could restrict the exposure of complex scenarios.
Issues: https://github.com/cockroachdb/cockroach/issues/106311
This issue tracks work that needs to be done to integrate randomness into the simulator framework, eliminating manual test setup and explicit assertions.
gen_cluster
command to support randomized parameters for zone configurations (including voter and non-voter constraints, localities) and a random number of nodes and stores per node.gen_ranges
,gen_load
to support randomized parametersset_liveness
to support randomized node status changeFuture work:
Anticipated difficulties:
Some ideas to help alleviate these issues:
Potentially useful libraries:
Note that this issue just outlines potential project directions. Some ideas might be out of scope of this project.
Jira issue: CRDB-29441