Handle mutations that affect the data layer

barkbay commented 5 years ago

As a follow up of #1693 we should handle the cases where the "data" type of a set of nodes is changed:

MD → M: data scale down, data should be migrated
M → MD: data scale up, nothing to do

sebgl commented 5 years ago

I tried locally to restart a data node (with shards) into a dedicated master node. Elasticsearch fails at startup with the following error:

["org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: Node is started with node.data=false, but has shard data: [/usr/share/elasticsearch/data/nodes/0/indices/2Su-Fc0uSTin6WHuuCfrcQ/0]. Use 'elasticsearch-node repurpose' tool to clean up"

It looks like we just cannot convert an existing data node to non-data node if it already hosts some shards.

We could detect that change in the ES resource and either:

refuse to apply that change
migrate shards away from the node before restarting it with the new configuration (not sure if that works, to be tested)

pebrc commented 5 years ago

Do we need to do anything here? Is it not the same principle of allowing users full access to make any configuration change they want and as a consequence shoot themselves in the foot, just as they would on bare metal?

Once a users (mis-) configures a data node into a master-only node they will have to fix the configuration and return the node into its previous state?

barkbay commented 5 years ago

Do we need to do anything here?

It would mean that we only handle automatically master type changes (D → MD and MD → D). But maybe you're right, we could consider it as a bad configuration change. Initially I wanted to move all the shards from the nodes if the user wants to turn them into master-only, but I'm not sure if it the right way to do that.

leosunmo commented 3 years ago

I just ran in to this when converting/Data Master nodes to Master-only. We had a very small cluster in early dev state, so to save money we allowed the Master nodes to store data as well. Now that we want to grow the cluster we decided to remove data from Master nodes (as is best practice).

The problem I see with this is that it's not quite the same as shooting myself in the foot on "bare metal" (ES straight on EC2 in my case prior to moving to ECK), because I cannot exec/ssh in to the node and run elasticsearch-node repurpose. The Master Pod ends up in a crash loop.

I don't quite know what the best practice here would be. Create a new Master node group and remove the old one once they are all up? I think we need a migration path here for Master/Data -> Master only node groups. Integrating elasticsearch-node repurpose somehow would make sense, or at least give me the opportunity to kubectl exec in to the Pod and run it myself.

leosunmo commented 3 years ago

I just tried to drain the shards from my masters before making them master-only nodes and ECK "rebalances" the shards before shutting down the first master, so that doesn't work.

barkbay commented 3 years ago

elasticsearch-node operations such as repurpose are considered as unsafe and are only available when a node is shut down, which means when the Elasticsearch process is not running. I'm not sure there is an easy way to do that, especially knowing that the Pods are managed by the StatefulSet controller.

Create a new Master node group and remove the old one once they are all up?

I think it should work, also instead of removing the old nodeSet I think you can just remove its master role if you want to keep the data there.

anyasabo commented 3 years ago

Catching this with webhook validation seems kind of a pain on our end (since we need to call ES to find out if it has shard data), but I wonder if it's worth clarifying in our docs since the ES error message isn't particularly helpful to people (I think to run the repurpose tool you would probably need to overwrite the command for the nodeset so it doesnt try to run ES?).

elastic / cloud-on-k8s

Handle mutations that affect the data layer #1755