Replace replica count, auto expand and rack awareness with topology constraints

apatrida commented 8 years ago

The current replica count, auto expand and rack awareness are really all competing for the same idea of shard safety, and not succeeding. They try to give shard safety but fail to be flexible and cause either hardship to managing the system, false errors, or situations where things just don't appear to work.

The following would replace these concepts in their entirety:

Introducing "Elasticsearch Topology Contraints"

You can guide or force the distribution of replicas by setting the following constraints on combinations of node attributes. The use of "replicas count" below indicates primary + replica shards as a whole as is no longer "extra copies".

DESIRE MIN REPLICAS (soft minimum)
DESIRE MAX REPLICAS (soft maximum)
REQUIRE MIN REPLICAS
REQUIRE MAX REPLICAS
DESIRE PRIMARY (primary shards)
REQUIRE PRIMARY (primary shards)
DISALLOW PRIMARY (primary shards)
DESIRE COMPLETE SHARDS (shard sets)
REQUIRE COMPLETE SHARDS (shard sets)
DISALLOW COMPLETE SHARDS (shard sets)
DESIRE INCOMPLETE SHARDS (shard sets)
REQUIRE INCOMPLETE SHARDS (shard sets)
DISALLOW INCOMPLETE SHARDS (shard sets)
DESIRE PRIMARY AND REPLICA SHARDS COLOCATED (but not within same node)
REQUIRE PRIMARY AND REPLICA SHARDS COLOCATED (but not within same node)
DISALLOW PRIMARY AND REPLICA SHARDS COLOCATED (default for not within same node)
REQUIRE INDEX APPROVAL (for indexes to be on node)
INCLUDE INDEX, EXCLUDE INDEX, REQUIRE INDEX, APPROVE INDEX (for indexes on node)

Node attributes are defined by the configuration as:

what are the list of possible attributes for the entire cluster, such as "zone", "rack", "nodetype"
what are the known attribute values for a given node, such as "zone: us-east", "rack: 1a", "nodetype: webapp"
special values are allowed to be matched as node attributes: _name, _host_ip, _publish_ip, _ip, _host, _index, _alias -- where the special value of _index matches index name, and _alias matches any index that has the alias attached. Others are the same as used in the old allocation filtering.

And node attribute values are matched against constraints using:

literal values
wildcards
some special values such as _all
Deprecated Features

These features are deprecated as they are encompassed in Topology Constraints:

replica count in index settings
auto-expand replicas in index settings
shard allocation awareness
forced shard allocation awareness
index shard allocation include, exclude, require
total shards per node
Example

Example Cluster

For all examples below, here is a cluster definition given each node and the node attributes set.

cluster node attributes: zone, rack, nodetype
cluster nodes:
- main-node-1: zone us-east, rack 1a, nodetype main
- main-node-2: zone us-east, rack 1c, nodetype main
- main-node-3: zone us-east, rack 1d, nodetype main
- main-node-4: zone us-west, rack 1a, nodetype main
- main-node-5: zone us-west, rack 1c, nodetype main
- main-node-6: zone us-west, rack 1d, nodetype main
- webapp-node-1: zone us-east, rack 1a, nodetype webapp
- webapp-node-2: zone us-east, rack 1c, nodetype webapp
- webapp-node-3: zone us-west, rack 1a, nodetype webapp
- webapp-node-4: zone us-west, rack 1d, nodetype webapp
- batch-node-1: zone us-east, rack 1d, nodetype batch
created indexes:
- Alpha with no initial replica count setting
- Beta with no initial replica count setting
- ReadHeavy with no initial replica count setting
- VeryImportant with initial REQUIRE MINIMUM REPLICAS count 3 defined in index settings

note: the above uses zones that may imply a WAN is present between data centers, and that may be a bad idea for a single cluster, so ignore that and focus on the example and not what those words mean in any given attribute, this is just to make an example that makes sense.

USE CASE 1:

I require each index to have minimum of 2 replicas, and each zone to have at least 1 replica

Topology constraints:

ATTRIBUTE ["_index"] VALUE ["*"] REQUIRE MIN REPLICAS 2
ATTRIBUTE ["zone", "_index"] VALUE ["*", "*"] REQUIRE MIN REPLICAS 1

In this case a Alpha, Beta, ReadHeavy would have 2 or more replicas (primary+copies) and with one in each of us-east and us-west, while VeryImportant would have 3 replicas (primary+copies) with one spread across the zones. To avoid this shard split, you can add the constraint:

ATTRIBUTE ["zone", "_index"] VALUE ["*", "*"] REQUIRE COMPLETE SHARDS

Now the 3rd replica set will be in one of the two zones, if you want to lean it towards zone us-east then have a higher desired count there.

ATTRIBUTE ["zone", "_index"] VALUE ["us-east", "VeryImportant"] DESIRE MIN REPLICAS 2

The planner would check all of the constraints and given that all the REQUIRED can be satisfied while meeting the DESIRED as well, it would put 2 full copies of VeryImportant index in zone us-east and one copy in us-west to satisfy the required count of 3 for this index.

Final constraints are therefore:

ATTRIBUTE ["_index"] VALUE ["*"] REQUIRE MIN REPLICAS 2
ATTRIBUTE ["zone", "_index"] VALUE ["*", "*"] REQUIRE MIN REPLICAS 1
ATTRIBUTE ["zone", "_index"] VALUE ["*", "*"] REQUIRE COMPLETE SHARDS
ATTRIBUTE ["zone", "_index"] VALUE ["us-east", "VeryImportant"] DESIRE MIN REPLICAS 2

_note: the use of _index in each attribute is for clarity, it could be allowed to be omitted when an index setting is used, therefore assuming * or all indexes._

USE CASE 2:

Same as use case 1 but I also desire that all primary shards to be in zone us-east unless that is not possible.

from case 1:
- ATTRIBUTE ["_index"] VALUE ["*"] REQUIRE MIN REPLICAS 2
- ATTRIBUTE ["zone", "_index"] VALUE ["*", "*"] REQUIRE MIN REPLICAS 1
- ATTRIBUTE ["zone", "_index"] VALUE ["*", "*"] REQUIRE COMPLETE SHARDS
- ATTRIBUTE ["zone", "_index"] VALUE ["us-east", "VeryImportant"] DESIRE MIN REPLICAS 2
for this case add:
- ATTRIBUTE ["zone", "_index"] VALUE ["us-east", "*"] DESIRE PRIMARY
  USE CASE 3:

Same as use case 2 but I also desire that all nodes have replicas of a VeryImportant and ReadHeavy indexes.

from case 2:
- ATTRIBUTE ["_index"] VALUE ["*"] REQUIRE MIN REPLICAS 2
- ATTRIBUTE ["zone", "_index"] VALUE ["*", "*"] REQUIRE MIN REPLICAS 1
- ATTRIBUTE ["zone", "_index"] VALUE ["*", "*"] REQUIRE COMPLETE SHARDS
- ATTRIBUTE ["zone", "_index"] VALUE ["us-east", "VeryImportant"] DESIRE MIN REPLICAS 2
- ATTRIBUTE ["zone", "_index"] VALUE ["us-east", "*"] DESIRE PRIMARY
for this case add:
- ATTRIBUTE ["_index] VALUE ["VeryImportant,ReadHeavy"] DESIRE MIN REPLICAS _all
  USE CASE 4:

Same as use case 3 but I also want each individual rack to have at least 1 replica of each index for safety.

from case 3:
- ATTRIBUTE ["_index"] VALUE ["*"] REQUIRE MIN REPLICAS 2
- ATTRIBUTE ["zone", "_index"] VALUE ["*", "*"] REQUIRE MIN REPLICAS 1
- ATTRIBUTE ["zone", "_index"] VALUE ["*", "*"] REQUIRE COMPLETE SHARDS
- ATTRIBUTE ["zone", "_index"] VALUE ["us-east", "VeryImportant"] DESIRE MIN REPLICAS 2
- ATTRIBUTE ["zone", "_index"] VALUE ["us-east", "*"] DESIRE PRIMARY
- ATTRIBUTE ["_index] VALUE ["VeryImportant,ReadHeavy"] DESIRE MIN REPLICAS _all
for this case add:
- ATTRIBUTE ["zone", "rack", "_index"] VALUE ["*", "*", "*"] DESIRE MIN REPLICAS 1
  USE CASE 5:

Same as use case 4 but I do not want any indexes on webapp or batch nodes that are not approved for those nodes. I only want ReadHeavy index on those nodes but without any primary shards.

from case 4:
- ATTRIBUTE ["_index"] VALUE ["*"] REQUIRE MIN REPLICAS 2
- ATTRIBUTE ["zone", "_index"] VALUE ["*", "*"] REQUIRE MIN REPLICAS 1
- ATTRIBUTE ["zone", "_index"] VALUE ["*", "*"] REQUIRE COMPLETE SHARDS
- ATTRIBUTE ["zone", "_index"] VALUE ["us-east", "VeryImportant"] DESIRE MIN REPLICAS 2
- ATTRIBUTE ["zone", "_index"] VALUE ["us-east", "*"] DESIRE PRIMARY
- ATTRIBUTE ["_index] VALUE ["VeryImportant,ReadHeavy"] DESIRE MIN REPLICAS _all
- ATTRIBUTE ["zone", "rack", "_index"] VALUE ["*", "*", "*"] DESIRE MIN REPLICAS 1
for this case add:
- ATTRIBUTE ["nodetype", "_index"] VALUE ["webapp, batch", "*"] REQUIRE INDEX APPROVAL
- ATTRIBUTE ["nodetype"] VALUE ["webapp, batch"] APPROVE INDEX ReadHeavy
- ATTRIBUTE ["nodetype", "_index"] VALUE ["webapp, batch", "ReadHeavy"] DISALLOW PRIMARY

We now have ReadHeavy index spreading some of its shards across webapp and batch nodes excluding any primaries. But what we actually want is to try and have a full copy per webapp machine and surely must have a copy per batch machine. The webapp machines could query through to the main cluster if not yet replicated their copy, but the batch machines should not to avoid pounding the main cluster nodes from heavy querying. So we add these constraints:

ATTRIBUTE ["nodetype", "_host", "_index"] VALUE ["webapp", "*", "ReadHeavy"] DESIRE MIN REPLICAS 1
ATTRIBUTE ["nodetype", "_host", "_index"] VALUE ["batch", "*", "ReadHeavy"] REQUIRE MIN REPLICAS 1

Now we have caused each host to have a full copy of the index for webapp and batch nodes, where it is a hard requirement for batch and soft for webapp.

Final constraints are therefore:

ATTRIBUTE ["_index"] VALUE ["*"] REQUIRE MIN REPLICAS 2
ATTRIBUTE ["zone", "_index"] VALUE ["*", "*"] REQUIRE MIN REPLICAS 1
ATTRIBUTE ["zone", "_index"] VALUE ["*", "*"] REQUIRE COMPLETE SHARDS
ATTRIBUTE ["zone", "_index"] VALUE ["us-east", "VeryImportant"] DESIRE MIN REPLICAS 2
ATTRIBUTE ["zone", "_index"] VALUE ["us-east", "*"] DESIRE PRIMARY
ATTRIBUTE ["_index] VALUE ["VeryImportant,ReadHeavy"] DESIRE MIN REPLICAS _all
ATTRIBUTE ["zone", "rack", "_index"] VALUE ["*", "*", "*"] DESIRE MIN REPLICAS 1
ATTRIBUTE ["nodetype", "_index"] VALUE ["webapp, batch", "*"] REQUIRE INDEX APPROVAL
ATTRIBUTE ["nodetype"] VALUE ["webapp, batch"] APPROVE INDEX ReadHeavy
ATTRIBUTE ["nodetype", "_index"] VALUE ["webapp, batch", "ReadHeavy"] DISALLOW PRIMARY
ATTRIBUTE ["nodetype", "_host", "_index"] VALUE ["webapp", "*", "ReadHeavy"] DESIRE MIN REPLICAS 1
ATTRIBUTE ["nodetype", "_host", "_index"] VALUE ["batch", "*", "ReadHeavy"] REQUIRE MIN REPLICAS 1

Which is fairly complex, but incredibly powerful.

USE CASE 6:

What if I have the reverse case of this ReadHeavy index and have an index WriteHeavy where i want to write mostly to special nodes and then replicate to the others. The same type of constraints could apply by REQUIRE PRIMARY to a few write-heavy nodes and setting DESIRE MIN REPLICAS to 1 for the same nodes so that they have a complete set of PRIMARY

If I want to hold replication to other nodes, I could add a REQUIRE INDEX APPROVAL to the other nodes (note: need a negation for values so can express "everything but this attribute values"), do the full index, then remove that constraint letting it replicate across once complete.

USE CASE 7:

I add a daily index that needs to have constraints, so the name suffix changes on each index, how do constraints apply? Use wildcards at end of index naming, for example:

ATTRIBUTE ["nodetype"] VALUE ["webapp, batch"] APPROVE INDEX DailySomething-*

or

ATTRIBUTE ["nodetype", "_host", "_index"] VALUE ["webapp", "*", "DailySomething-*"] DESIRE MIN REPLICAS 1

or if the indexes are all added to an alias, constraints could be based on the alias. The only issue is that sometimes these type of indexes are created and then added to the index after a while, so maybe the temporary index has constraints that match wildcards, and the final alias has constraints that match on it.

ATTRIBUTE ["zone", "_alias"] VALUE ["us-east", "DailyAll"] DESIRE MIN REPLICAS 3
Constraints on specific indexes:

Should there be constraints on indexes other than starting minimum replicas? And should these override topology constraints? For example, restore a daily generated index with initial REQUIRE MAX REPLICAS 1 replicas and then once restored increase by changing to an index specific DESIRED MIN REPLICAS _all; or remove my index specific MAX of 1 and let the topology constraints take over.

Per index constraints are the beginner case, and special case. Topology should be the production norm. But allowing index overrides (warning when in conflict with Topology) for special cases allows cases described above.

Constraint weighting

Should conflicts between constraints be resolved by some weight given to constraints? If same weight it is a problem, if different weight then the higher weight wins. You put your safety constraints as the highest.

Constraint precedence:

Should there be precedence rules for constraints?

For example, for replica count there is precedence here when multiple constraints overlap. MIN wins over DESIRED and MAX, MAX wins over DESIRED. A conflict with MAX results in a cluster warning that can be viewed from cluster state. Only failures on MIN to be satisfied result in index state of RED.

GREEN/YELLOW/RED Cluster and Index Status:

Cluster and index status should be queried a bit differently.

If you ask at the cluster level you have a rollup of all index status and the constraints that apply.
If you ask at the index level you have a rollup of all constraints that apply to the index status and the general index health (all primaries are valid and assigned).
If you ask at the node attribute matching level, you have a rollup of all matching indexes/nodes
If you ask at the specific constraint level, you have the results of that constraint

So in use case 5 where batch nodes MUST have a full copy of the index, and webapp SHOULD have a full copy. A batch process would wait for yellow state on ReadHeavy index by querying status for:

STATUS OF CONSTRAINTS ATTRIBUTE ["nodetype", "_host", "_index"] VALUE ["batch", "_me", "ReadHeavy"]

So if index health of ReadHeavy is generally ok (all primaries exist) and the constraints that affect nodetype batch on "_me" (my) host for index ReadHeavy are all satisfied then I receive a GREEN response. And my batch would continue.

TODO: this needs heavy spec'ing to figure out how status vs. constraints are met. But I think the index base health is only based on index specific constraints and not the global ones. Other health checks should be from the perspective of the user of the index to ensure the constraints that matter to them are satisfied. You could always say "for index ReadHeavy are ALL constraints satisfied EVERYWHERE" and do a higher level index check. But more fine grained makes sense, for example one data center being YELLOW but mine being GREEN should allow me to do what I want to do in my app.

For Later, DISTANCE Constraints

Adding distance constraints later will allow the system to do smart things about how to know what is SAME rack, NEAR in datacenter, or FAR such as WAN, or with a numeric value so that index replication across WANs can be handled differently such as a topology constraint for FAR sets of nodes use async replication instead of sync. But they are not needed for the above use cases.

I also want to add that it is extremely difficult to digest issues like this. I can see what you tried to achieve but we have to break things up into smaller and more digestible solutions. This entire thing seems like implementable in a lot of small PRs and issues and deprecate things step by step. That way it's faster to get improvements to the user, get better code reviews, add better APIs, make things simpler and clear.

apatrida commented 8 years ago

@s1monw First the concept (like this) needs to be accepted as an idea that we want to pursue. After we can break it down and re-address it as elemental PR's/Issues. But it has to be thought through as a whole to make something that actually works. While @jpountz says the current system is simple, because it has some simple parts that you can use to make some simple things happen ... but it also completely fails at making truly scalable topologies for big deployments that need control and have absolutely no way to make it work. The current individual control mechanisms do not work together to describe cohesive topologies.

So what I was hoping for here was to get agreement on something along these lines, check out use cases and collect more info, then come back with the building blocks to make it happen. Instead of closing big ideas, there needs to be someway (from outside your internal meetings) to discuss the big things, then come back at them.

apatrida commented 8 years ago

@jpountz not sure how simple is a combination of the current features:

rack awareness, which causes replicas and masters to try to be on separate racks
forced rack awareness, which works if you can define all of your racks ahead of time and ensures each rack has a copy
replica count range such as 3-ALL which has a permanent bug #2869 which interferes with the other two and doesn't always seem to respond or do what it says it will, and definitely doesn't when using the above

With these three features you can end up in error states and things not really close to what you intended. You can't control master distribution, you can't do things like ensuring a full copy of an index is on each node, etc.

By constraints you can say a few simple ones to replace the above "simple" features and have documentation that is one page. So with 2 or 3 constraints that are in identical form you express the same thing, from which you can add a few more to do wondrous topologies. So there is a simple easy to document version of the above for those simple cases. Simplicity is how you present it and what you show the simple users. A documentation problem. That should limit writing a feature to have full control.

I can write a sample constraint solver if you want to see how it is not very complicated code to answer for a given index and node what should happen. If that helps.

clintongormley commented 8 years ago

Sorry @apatrida - I read this through again and it just made my head hurt. While reading your examples I couldn't translate them into what the cluster would actually look like. It feels like reading Sendmail config. Clearly our existing settings are not perfect, but they are at least fairly easy to understand. I'd prefer working on incremental improvements to what we have instead.

apatrida commented 8 years ago

@clintongormley maybe another approach is for PR to allow some plugin to decide shard placement and replica count preferences for the core engine and swap out the current model via plugins. So an advanced topology plugin could take over this decision making.

apatrida commented 8 years ago

Has anyone at elastic actually hit use cases that cannot be done at all with current settings? I'd be surprised if you have not, because we do regularly and hit walls with the current model that prevent improving performance or reliability.

bleskes commented 8 years ago

@apatrida may I suggest that you come up with a simple example where you show a concrete set of nodes (node 1,2,3,4... ), a concrete set of index shards and show how you want them to be layouted on your nodes in the form of "index 1 shard 1 primary, on node 1"? Also please explain why you want them that way. I see you gave quite detailed examples of possible configuration but they involve many moving pieces and I fail to understand the why. That will help understand what your are trying to achieve. See why it can not be done with the current system through a concrete example (as you say there are many theoretical limitations) and also think about potential alternative solutions to let you do what you try to do.

Note that while I understand that you try to create a system that allows to do anything, there is a lot of value in a simple system everyone can understand albeit constrained. We try er towards the latter.

elastic / elasticsearch