ILM Phase Execution on Index Count, Aggregate Size, or FIFO

woodchalk commented 4 years ago

ILM phases outside of hot rely exclusively on min_age for execution. There is currently no way to execute phases on any other criteria, which leaves Elasticsearch susceptible to out-of-space emergencies when indexes increase slowly over time. Age-based execution may be advantageous to policy (keep abc logs for xyz months), but it is not useful for resource maximization (I want to use 90% of disk space).

Executing phases based on the count of indexes or aggregate sizes promotes better resource usage. I’m more interested in keeping as many indexes as my infrastructure will allow. I see a few ways to achieve that.

Execute phases based on index count. This model would allow you to define fixed index counts within each policy. The advantage being that this is easy. For example: I’d like to rollover hot after 10GB, and keep 9 indexes in warm. This policy would never grow past 100GB.

Execute phases based on aggregate size. This model would allow you to define cumulative index sizes within a phase. The advantage being that this is also easy, but covers more corner cases than a simple count. For example: I’d like to rollover hot after 10GB or 2 days, and keep 90GB of indexes in warm. This policy would keep as much data as possible within the aggregate bounds defined. Perhaps the daily indexes grow to 10GB, but the weekend indexes grow to only 4GB, this would ensure you keep as much data in the policy as possible.

Execute phases based on FIFO. At a high level, remove the oldest indexes within the cluster on a first in first out basis. You define an operating threshold with a cluster and enforce a delete phase when you reach it. The advantage being that this is truly disaster-proof (i.e. no more read_only_allow_delete!!). For example: My 1TB cluster should remove the oldest index when my indexes use more than 90% disk space or 900GB.

elasticmachine commented 4 years ago

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

hamishforbes commented 4 years ago

Just came here to post this exact issue.

Index count based phases would be incredibly useful I'm in the process of moving various services over from Cloudwatch logging to logging into ELK. This means my log volume is wildy variable.

Index rollover happens on xGB, this is great. My indices are all always the same size. However I constantly have out of disk space issues because my indices are deleted after n days, but on Monday I might create 5 indices of x GB and on Tuesday only 1 index.

Setting retention based on days is causing all kinds of problems for me. I either set my retention really low and have loads of (paid for) wasted disk space, or set it correctly for the current workload and have it blow up at 2am because my workload has changed, and losing a bunch of logs until I can fix it.

hamishforbes commented 4 years ago

Hi, has there been any progress on this issue? It's come up for us again because we've seen ~30-40% increase in traffic this week (COVID-19 related) which has caused a corresponding increase in log data and our elastic cluster blew up at 4am due to running out of disk, again.

I'm curious why @jakelandis added the 'high hanging fruit' label, could someone elaborate on why this is difficult? It seems like adding a count condition as well as max_age would be very simple, but I'm not familiar with the codebase.

It looks like there's been a couple of other issues posted relating to this as well #49392 #52308

jasontedor commented 4 years ago

@hamishforbes It is high-hanging fruit because of the architecture of ILM, which is oriented around managing a single index at a time, but the request here is to manage a group of indices (e.g., whose name share a common prefix). Since that's a fundamental rearchitecture/requires an investment in new infrastructure in the codebase, there isn't a quick win here.

hamishforbes commented 4 years ago

Ah I see, because an ILM policy can apply to multiple groups of indices. That makes sense, thanks for the insight!

0xtf commented 4 years ago

Just wanted to add a +1 to this.

Defining an ILM policy solely on size is absolutely critical for inconsistent workloads.

There are many examples/scenarios, but one I personally experience is how hard it is to size intake for network-related data. If I have one site that I can do a rough estimation, hardly additional sites will follow the same principles (number of people, type of traffic, site function (datacenter vs office) and many other factors).

The possibility of FIFO would ease things even further.

At this point I know of many deployments that still haven't found a balance between amount of data to keep and availability, so deployments are actually wasting resources but getting rid of data too soon with the fear of a sudden intake causing downtime.

hamishforbes commented 4 years ago

FWIW I have since disabled the delete phase in my ILM policy and switched back to elasticsearch-curator using an index prefix and count for retention

Here'e a graph of % free space across my logging cluster, I don't think I need to point out which day I made the switch on :)

0xtf commented 4 years ago

That’s interesting and definitely seems like a viable alternative! :)

Unfortunately, as a customer of Elastic Cloud, Curator is not an option (unless I have it running somewhere else, which kind of defeats the purpose of EC in the first place).

On Fri, 24 Apr 2020 at 08:50, Hamish Forbes notifications@github.com wrote:

FWIW I have since disabled the delete phase in my ILM policy and switched back to elasticsearch-curator using an index prefix and count for retention

Here'e a graph of % free space across my logging cluster, I don't think I need to point out which day I made the switch on :) [image: Screenshot 2020-04-24 at 08 47 32] https://user-images.githubusercontent.com/1282135/80187881-48f38280-8608-11ea-849d-45bb9a8f4b5f.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/elastic/elasticsearch/issues/47764#issuecomment-618860920, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFZM7CHOHH7UQDA6TK5DSTDROFADLANCNFSM4I6ZJUMQ .

cataclysdom commented 4 years ago

+1

ILM needs to handle the overall index lifecycle instead of a single index at a time. Curator is an option, but just another workaround for functionality that should exist as a core component.

fbaligand commented 4 years ago

At least, having a condition based on index count would be great, and not so complicated I think. ILM is based on an alias, so it can know how many indices are linked to the alias (BTW, Kibana ILM policy management shows this info). Then, choose the oldest index linked to the alias, when index count is over the limit, to perform the phase action.

georgettica commented 4 years ago

hey! what is the state of this issue? curious to know and help if I can

krejkrejkrej commented 3 years ago

Apart from a "me too" on this request, let me also add my use case for this functionality. Similar, but slightly different.

We have one collection of time based log index series that are important, there must always be space available in the cluster to ingest new logs for these series. There are also other less-important log index series in the cluster. I want to set a hard limit on the size of the less-important indices, to make sure that a bad-behaving less-important service can not fill up disk space and cause the important indices to go read-only. Number of indices in a series, or the cumulative size, in bytes or documents, of all indices in a series - any one would do. When this limit has been reached - let ILM execute an action like "delete the oldest index" or "reject writes to the indices".

berglh commented 1 year ago

I'm getting started on ILM with our ECE deployment, and I was surprised to find that I am unable to phase change based on the number of indices sitting behind a data stream in Hot Tier. We have around 40 data streams in our legacy cluster which uses date-based suffixes on yearly, monthly, weekly and daily rotation strategies. I managed to classify these into 6 different ILM policies based on size of index for rollover and number of indices to keep in hot and frozen tiers. However, if I am limited by age, I will need to create 40 different ILM policies to get a similar effect. I am fine not basing retention on age, as the limiting factor of the cluster is the storage; using age to define retention seems short sighted when there are practical limits to RAM to Storage ratios on licensed capacity. By using size, we can predictably create safe limits for data streams that will stay within the storage confinements of our architecture.

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-data-management (Team:Data Management)

nicpenning commented 11 months ago

There are some interesting use cases here for sure!

I can see where it's not always about retention days because that does not typically answer how much data from a storage perspective is being retained.

As a security engineer of data resources in the stack, I may have new log sources or ones that are unpredictable in data consumption. Because of this, I would like to set a max storage consumption on a data stream that does allow the oldest index or indices to be removed after the respective threshold in GB is reached.

elastic / elasticsearch

ILM Phase Execution on Index Count, Aggregate Size, or FIFO #47764