Rework ILM to not Require Inspecting all Indices on every Cluster State Update

original-brownbear commented 2 years ago

At the moment ILM scales somewhat poorly as we move to very large numbers of indices. The reason for this is that org.elasticsearch.xpack.ilm.IndexLifecycleService#clusterChanged does a full inspection of all indices in the cluster state to see if there is work to be done by ILM.

This inspection of all the indices itself is fairly expensive because it requires parsing per-index metadata into LifecycleExecutionState (repeatedly) and more importantly calls the expensive org.elasticsearch.xpack.ilm.IndexLifecycleRunner#getCurrentStep(org.elasticsearch.xpack.ilm.PolicyStepsRegistry, java.lang.String, org.elasticsearch.cluster.metadata.IndexMetadata, org.elasticsearch.xpack.core.ilm.LifecycleExecutionState) in a hot loop.

Ideally, ILM should be refactored into something more similar to the SnapshotService which will only do a full inspection of all snapshots+shards on a master failover, but otherwise keeps track of its internal state directly on the master node. Concretely, this would mean that when an index moves from one state to another state, the requires actions would just be chained logically through a series of callbacks rollover-step -> do rollover -> next-step instead of the current model where the step transitions are triggered by the changes in the cluster state that the previous step caused.

This would make ILM scale pretty much O(1) outside of the master-failover scenario.

Relates #77466

elasticmachine commented 2 years ago

Pinging @elastic/es-data-management (Team:Data Management)

PDTCCLF commented 2 years ago

I'd like to send a pull request to this issue.

elastic / elasticsearch

Rework ILM to not Require Inspecting all Indices on every Cluster State Update #80407