At the moment ILM scales somewhat poorly as we move to very large numbers of indices. The reason for this is that org.elasticsearch.xpack.ilm.IndexLifecycleService#clusterChanged does a full inspection of all indices in the cluster state to see if there is work to be done by ILM.
This inspection of all the indices itself is fairly expensive because it requires parsing per-index metadata into LifecycleExecutionState (repeatedly) and more importantly calls the expensive org.elasticsearch.xpack.ilm.IndexLifecycleRunner#getCurrentStep(org.elasticsearch.xpack.ilm.PolicyStepsRegistry, java.lang.String, org.elasticsearch.cluster.metadata.IndexMetadata, org.elasticsearch.xpack.core.ilm.LifecycleExecutionState) in a hot loop.
Ideally, ILM should be refactored into something more similar to the SnapshotService which will only do a full inspection of all snapshots+shards on a master failover, but otherwise keeps track of its internal state directly on the master node.
Concretely, this would mean that when an index moves from one state to another state, the requires actions would just be chained logically through a series of callbacks rollover-step -> do rollover -> next-step instead of the current model where the step transitions are triggered by the changes in the cluster state that the previous step caused.
This would make ILM scale pretty much O(1) outside of the master-failover scenario.
At the moment ILM scales somewhat poorly as we move to very large numbers of indices. The reason for this is that
org.elasticsearch.xpack.ilm.IndexLifecycleService#clusterChanged
does a full inspection of all indices in the cluster state to see if there is work to be done by ILM.This inspection of all the indices itself is fairly expensive because it requires parsing per-index metadata into
LifecycleExecutionState
(repeatedly) and more importantly calls the expensiveorg.elasticsearch.xpack.ilm.IndexLifecycleRunner#getCurrentStep(org.elasticsearch.xpack.ilm.PolicyStepsRegistry, java.lang.String, org.elasticsearch.cluster.metadata.IndexMetadata, org.elasticsearch.xpack.core.ilm.LifecycleExecutionState)
in a hot loop.Ideally, ILM should be refactored into something more similar to the
SnapshotService
which will only do a full inspection of all snapshots+shards on a master failover, but otherwise keeps track of its internal state directly on the master node. Concretely, this would mean that when an index moves from one state to another state, the requires actions would just be chained logically through a series of callbacks rollover-step -> do rollover -> next-step instead of the current model where the step transitions are triggered by the changes in the cluster state that the previous step caused.This would make ILM scale pretty much O(1) outside of the master-failover scenario.
Relates #77466