Closed apatrida closed 8 years ago
We just discussed it in our weekly Fixit Friday meeting. While this would be more flexible, this also makes things more complicated while the current way of configuring allocation is simple (at least in the simple cases) and enough for most users. So we'd rather keep the current way.
I also want to add that it is extremely difficult to digest issues like this. I can see what you tried to achieve but we have to break things up into smaller and more digestible solutions. This entire thing seems like implementable in a lot of small PRs and issues and deprecate things step by step. That way it's faster to get improvements to the user, get better code reviews, add better APIs, make things simpler and clear.
@s1monw First the concept (like this) needs to be accepted as an idea that we want to pursue. After we can break it down and re-address it as elemental PR's/Issues. But it has to be thought through as a whole to make something that actually works. While @jpountz says the current system is simple, because it has some simple parts that you can use to make some simple things happen ... but it also completely fails at making truly scalable topologies for big deployments that need control and have absolutely no way to make it work. The current individual control mechanisms do not work together to describe cohesive topologies.
So what I was hoping for here was to get agreement on something along these lines, check out use cases and collect more info, then come back with the building blocks to make it happen. Instead of closing big ideas, there needs to be someway (from outside your internal meetings) to discuss the big things, then come back at them.
@jpountz not sure how simple is a combination of the current features:
3-ALL
which has a permanent bug #2869 which interferes with the other two and doesn't always seem to respond or do what it says it will, and definitely doesn't when using the aboveWith these three features you can end up in error states and things not really close to what you intended. You can't control master distribution, you can't do things like ensuring a full copy of an index is on each node, etc.
By constraints you can say a few simple ones to replace the above "simple" features and have documentation that is one page. So with 2 or 3 constraints that are in identical form you express the same thing, from which you can add a few more to do wondrous topologies. So there is a simple easy to document version of the above for those simple cases. Simplicity is how you present it and what you show the simple users. A documentation problem. That should limit writing a feature to have full control.
I can write a sample constraint solver if you want to see how it is not very complicated code to answer for a given index and node what should happen. If that helps.
Sorry @apatrida - I read this through again and it just made my head hurt. While reading your examples I couldn't translate them into what the cluster would actually look like. It feels like reading Sendmail config. Clearly our existing settings are not perfect, but they are at least fairly easy to understand. I'd prefer working on incremental improvements to what we have instead.
@clintongormley maybe another approach is for PR to allow some plugin to decide shard placement and replica count preferences for the core engine and swap out the current model via plugins. So an advanced topology plugin could take over this decision making.
Has anyone at elastic actually hit use cases that cannot be done at all with current settings? I'd be surprised if you have not, because we do regularly and hit walls with the current model that prevent improving performance or reliability.
@apatrida may I suggest that you come up with a simple example where you show a concrete set of nodes (node 1,2,3,4... ), a concrete set of index shards and show how you want them to be layouted on your nodes in the form of "index 1 shard 1 primary, on node 1"? Also please explain why you want them that way. I see you gave quite detailed examples of possible configuration but they involve many moving pieces and I fail to understand the why. That will help understand what your are trying to achieve. See why it can not be done with the current system through a concrete example (as you say there are many theoretical limitations) and also think about potential alternative solutions to let you do what you try to do.
Note that while I understand that you try to create a system that allows to do anything, there is a lot of value in a simple system everyone can understand albeit constrained. We try er towards the latter.
The current replica count, auto expand and rack awareness are really all competing for the same idea of shard safety, and not succeeding. They try to give shard safety but fail to be flexible and cause either hardship to managing the system, false errors, or situations where things just don't appear to work.
The following would replace these concepts in their entirety:
Introducing "Elasticsearch Topology Contraints"
You can guide or force the distribution of replicas by setting the following constraints on combinations of node attributes. The use of "replicas count" below indicates primary + replica shards as a whole as is no longer "extra copies".
Node attributes are defined by the configuration as:
_name
,_host_ip
,_publish_ip
,_ip
,_host
,_index
,_alias
-- where the special value of_index
matches index name, and_alias
matches any index that has the alias attached. Others are the same as used in the old allocation filtering.And node attribute values are matched against constraints using:
_all
Deprecated Features
These features are deprecated as they are encompassed in Topology Constraints:
Example
Example Cluster
For all examples below, here is a cluster definition given each node and the node attributes set.
zone
,rack
,nodetype
main-node-1
: zoneus-east
, rack1a
, nodetypemain
main-node-2
: zoneus-east
, rack1c
, nodetypemain
main-node-3
: zoneus-east
, rack1d
, nodetypemain
main-node-4
: zoneus-west
, rack1a
, nodetypemain
main-node-5
: zoneus-west
, rack1c
, nodetypemain
main-node-6
: zoneus-west
, rack1d
, nodetypemain
webapp-node-1
: zoneus-east
, rack1a
, nodetypewebapp
webapp-node-2
: zoneus-east
, rack1c
, nodetypewebapp
webapp-node-3
: zoneus-west
, rack1a
, nodetypewebapp
webapp-node-4
: zoneus-west
, rack1d
, nodetypewebapp
batch-node-1
: zoneus-east
, rack1d
, nodetypebatch
Alpha
with no initial replica count settingBeta
with no initial replica count settingReadHeavy
with no initial replica count settingVeryImportant
with initial REQUIRE MINIMUM REPLICAS count3
defined in index settingsnote: the above uses zones that may imply a WAN is present between data centers, and that may be a bad idea for a single cluster, so ignore that and focus on the example and not what those words mean in any given attribute, this is just to make an example that makes sense.
USE CASE 1:
I require each index to have minimum of 2 replicas, and each zone to have at least 1 replica
Topology constraints:
["_index"]
VALUE["*"]
REQUIRE MIN REPLICAS2
["zone", "_index"]
VALUE["*", "*"]
REQUIRE MIN REPLICAS1
In this case a
Alpha
,Beta
,ReadHeavy
would have 2 or more replicas (primary+copies) and with one in each ofus-east
andus-west
, whileVeryImportant
would have 3 replicas (primary+copies) with one spread across the zones. To avoid this shard split, you can add the constraint:["zone", "_index"]
VALUE["*", "*"]
REQUIRE COMPLETE SHARDSNow the 3rd replica set will be in one of the two zones, if you want to lean it towards zone
us-east
then have a higher desired count there.["zone", "_index"]
VALUE["us-east", "VeryImportant"]
DESIRE MIN REPLICAS2
The planner would check all of the constraints and given that all the REQUIRED can be satisfied while meeting the DESIRED as well, it would put 2 full copies of
VeryImportant
index in zoneus-east
and one copy inus-west
to satisfy the required count of3
for this index.Final constraints are therefore:
["_index"]
VALUE["*"]
REQUIRE MIN REPLICAS2
["zone", "_index"]
VALUE["*", "*"]
REQUIRE MIN REPLICAS1
["zone", "_index"]
VALUE["*", "*"]
REQUIRE COMPLETE SHARDS["zone", "_index"]
VALUE["us-east", "VeryImportant"]
DESIRE MIN REPLICAS2
_note: the use of
_index
in each attribute is for clarity, it could be allowed to be omitted when an index setting is used, therefore assuming*
or all indexes._USE CASE 2:
Same as use case 1 but I also desire that all primary shards to be in zone
us-east
unless that is not possible.["_index"]
VALUE["*"]
REQUIRE MIN REPLICAS2
["zone", "_index"]
VALUE["*", "*"]
REQUIRE MIN REPLICAS1
["zone", "_index"]
VALUE["*", "*"]
REQUIRE COMPLETE SHARDS["zone", "_index"]
VALUE["us-east", "VeryImportant"]
DESIRE MIN REPLICAS2
["zone", "_index"]
VALUE["us-east", "*"]
DESIRE PRIMARYUSE CASE 3:
Same as use case 2 but I also desire that all nodes have replicas of a
VeryImportant
andReadHeavy
indexes.["_index"]
VALUE["*"]
REQUIRE MIN REPLICAS2
["zone", "_index"]
VALUE["*", "*"]
REQUIRE MIN REPLICAS1
["zone", "_index"]
VALUE["*", "*"]
REQUIRE COMPLETE SHARDS["zone", "_index"]
VALUE["us-east", "VeryImportant"]
DESIRE MIN REPLICAS2
["zone", "_index"]
VALUE["us-east", "*"]
DESIRE PRIMARY["_index]
VALUE["VeryImportant,ReadHeavy"]
DESIRE MIN REPLICAS_all
USE CASE 4:
Same as use case 3 but I also want each individual
rack
to have at least 1 replica of each index for safety.["_index"]
VALUE["*"]
REQUIRE MIN REPLICAS2
["zone", "_index"]
VALUE["*", "*"]
REQUIRE MIN REPLICAS1
["zone", "_index"]
VALUE["*", "*"]
REQUIRE COMPLETE SHARDS["zone", "_index"]
VALUE["us-east", "VeryImportant"]
DESIRE MIN REPLICAS2
["zone", "_index"]
VALUE["us-east", "*"]
DESIRE PRIMARY["_index]
VALUE["VeryImportant,ReadHeavy"]
DESIRE MIN REPLICAS_all
["zone", "rack", "_index"]
VALUE["*", "*", "*"]
DESIRE MIN REPLICAS1
USE CASE 5:
Same as use case 4 but I do not want any indexes on
webapp
orbatch
nodes that are not approved for those nodes. I only wantReadHeavy
index on those nodes but without any primary shards.["_index"]
VALUE["*"]
REQUIRE MIN REPLICAS2
["zone", "_index"]
VALUE["*", "*"]
REQUIRE MIN REPLICAS1
["zone", "_index"]
VALUE["*", "*"]
REQUIRE COMPLETE SHARDS["zone", "_index"]
VALUE["us-east", "VeryImportant"]
DESIRE MIN REPLICAS2
["zone", "_index"]
VALUE["us-east", "*"]
DESIRE PRIMARY["_index]
VALUE["VeryImportant,ReadHeavy"]
DESIRE MIN REPLICAS_all
["zone", "rack", "_index"]
VALUE["*", "*", "*"]
DESIRE MIN REPLICAS1
["nodetype", "_index"]
VALUE["webapp, batch", "*"]
REQUIRE INDEX APPROVAL["nodetype"]
VALUE["webapp, batch"]
APPROVE INDEXReadHeavy
["nodetype", "_index"]
VALUE["webapp, batch", "ReadHeavy"]
DISALLOW PRIMARYWe now have
ReadHeavy
index spreading some of its shards acrosswebapp
andbatch
nodes excluding any primaries. But what we actually want is to try and have a full copy perwebapp
machine and surely must have a copy perbatch
machine. Thewebapp
machines could query through to the main cluster if not yet replicated their copy, but thebatch
machines should not to avoid pounding the main cluster nodes from heavy querying. So we add these constraints:["nodetype", "_host", "_index"]
VALUE["webapp", "*", "ReadHeavy"]
DESIRE MIN REPLICAS1
["nodetype", "_host", "_index"]
VALUE["batch", "*", "ReadHeavy"]
REQUIRE MIN REPLICAS1
Now we have caused each host to have a full copy of the index for
webapp
andbatch
nodes, where it is a hard requirement forbatch
and soft forwebapp
.Final constraints are therefore:
["_index"]
VALUE["*"]
REQUIRE MIN REPLICAS2
["zone", "_index"]
VALUE["*", "*"]
REQUIRE MIN REPLICAS1
["zone", "_index"]
VALUE["*", "*"]
REQUIRE COMPLETE SHARDS["zone", "_index"]
VALUE["us-east", "VeryImportant"]
DESIRE MIN REPLICAS2
["zone", "_index"]
VALUE["us-east", "*"]
DESIRE PRIMARY["_index]
VALUE["VeryImportant,ReadHeavy"]
DESIRE MIN REPLICAS_all
["zone", "rack", "_index"]
VALUE["*", "*", "*"]
DESIRE MIN REPLICAS1
["nodetype", "_index"]
VALUE["webapp, batch", "*"]
REQUIRE INDEX APPROVAL["nodetype"]
VALUE["webapp, batch"]
APPROVE INDEXReadHeavy
["nodetype", "_index"]
VALUE["webapp, batch", "ReadHeavy"]
DISALLOW PRIMARY["nodetype", "_host", "_index"]
VALUE["webapp", "*", "ReadHeavy"]
DESIRE MIN REPLICAS1
["nodetype", "_host", "_index"]
VALUE["batch", "*", "ReadHeavy"]
REQUIRE MIN REPLICAS1
Which is fairly complex, but incredibly powerful.
USE CASE 6:
What if I have the reverse case of this
ReadHeavy
index and have an indexWriteHeavy
where i want to write mostly to special nodes and then replicate to the others. The same type of constraints could apply by REQUIRE PRIMARY to a few write-heavy nodes and setting DESIRE MIN REPLICAS to1
for the same nodes so that they have a complete set of PRIMARYIf I want to hold replication to other nodes, I could add a REQUIRE INDEX APPROVAL to the other nodes (note: need a negation for values so can express "everything but this attribute values"), do the full index, then remove that constraint letting it replicate across once complete.
USE CASE 7:
I add a daily index that needs to have constraints, so the name suffix changes on each index, how do constraints apply? Use wildcards at end of index naming, for example:
["nodetype"]
VALUE["webapp, batch"]
APPROVE INDEXDailySomething-*
or
["nodetype", "_host", "_index"]
VALUE["webapp", "*", "DailySomething-*"]
DESIRE MIN REPLICAS1
or if the indexes are all added to an alias, constraints could be based on the alias. The only issue is that sometimes these type of indexes are created and then added to the index after a while, so maybe the temporary index has constraints that match wildcards, and the final alias has constraints that match on it.
["zone", "_alias"]
VALUE["us-east", "DailyAll"]
DESIRE MIN REPLICAS3
Constraints on specific indexes:
Should there be constraints on indexes other than starting minimum replicas? And should these override topology constraints? For example, restore a daily generated index with initial REQUIRE MAX REPLICAS
1
replicas and then once restored increase by changing to an index specific DESIRED MIN REPLICAS_all
; or remove my index specific MAX of 1 and let the topology constraints take over.Per index constraints are the beginner case, and special case. Topology should be the production norm. But allowing index overrides (warning when in conflict with Topology) for special cases allows cases described above.
Constraint weighting
Should conflicts between constraints be resolved by some weight given to constraints? If same weight it is a problem, if different weight then the higher weight wins. You put your safety constraints as the highest.
Constraint precedence:
Should there be precedence rules for constraints?
For example, for replica count there is precedence here when multiple constraints overlap. MIN wins over DESIRED and MAX, MAX wins over DESIRED. A conflict with MAX results in a cluster warning that can be viewed from cluster state. Only failures on MIN to be satisfied result in index state of RED.
GREEN/YELLOW/RED Cluster and Index Status:
Cluster and index status should be queried a bit differently.
So in use case 5 where
batch
nodes MUST have a full copy of the index, andwebapp
SHOULD have a full copy. Abatch
process would wait for yellow state onReadHeavy
index by querying status for:["nodetype", "_host", "_index"]
VALUE["batch", "_me", "ReadHeavy"]
So if index health of
ReadHeavy
is generally ok (all primaries exist) and the constraints that affect nodetypebatch
on "_me" (my) host for indexReadHeavy
are all satisfied then I receive a GREEN response. And my batch would continue.TODO: this needs heavy spec'ing to figure out how status vs. constraints are met. But I think the index base health is only based on index specific constraints and not the global ones. Other health checks should be from the perspective of the user of the index to ensure the constraints that matter to them are satisfied. You could always say "for index ReadHeavy are ALL constraints satisfied EVERYWHERE" and do a higher level index check. But more fine grained makes sense, for example one data center being YELLOW but mine being GREEN should allow me to do what I want to do in my app.
For Later, DISTANCE Constraints
Adding distance constraints later will allow the system to do smart things about how to know what is SAME rack, NEAR in datacenter, or FAR such as WAN, or with a numeric value so that index replication across WANs can be handled differently such as a topology constraint for FAR sets of nodes use async replication instead of sync. But they are not needed for the above use cases.
Related:
see related: #18723 which isn't needed if this issue is done. Maybe #18723 is a stopgap for 5.x while this issue is for 6.x of Elasticsearch.