elastic / curator

Curator: Tending your Elasticsearch indices
Other
3.04k stars 635 forks source link

Manual ILM advance based on disk usage (per tier) #1692

Open untergeek opened 1 year ago

untergeek commented 1 year ago

The following is from the Elastic Discuss forum

This is a feature request for Curator.

I know that Elastic's official stance is that clusters should be sized for retention time requirements, and I know the answer to exhausting disk space is to enable automatic scaling in Elastic Cloud. Due to this, Elastic appears resistant to add the ability to trigger an ILM phase transition based on disk usage, That's why I believe there's room for this feature in Curator.

Consider that Curator currently has the ability to delete the oldest indices when disk usage hits a specified level. That is the equivalent to ILM's Delete phase, except that ILM cannot trigger the transition by disk usage, only by index age. By using disk usage, one would be able to extend retention times opportunistically by fully utilizing the available disk space, rather than leaving space on the table when ingestion is lower than expected. It also acts as a fallback to avoid running out of space and blocking completely when ingestion is higher than expected and auto-scale is not enabled. In most cases, I believe continuing to ingest current data takes precedence over maintaining the oldest indices.

In a multi-tier environment, the warm nodes, for example, allow speedier queries, but it's often not critical to maintain exactly 7 days of warm, and dropping to 6 when space is exhausted is fine. Similarly, if warm nodes are slightly oversized, retention could be extended to 8 or 9 days rather than forcing a move to cold at exactly 7 days, as long as it doesn't risk disk exhaustion. Plus, if cold or frozen is much larger, a day or two of high ingestion rate may not require more storage at that tier, but could exhaust warm node space if nothing is done. Such a spike in ingestion would require either a temporary growth in warm nodes (auto-scale) or an earlier transition from warm to cold for a few days (this feature).

If the ability to transition phases based on disk usage will never be added to ILM, then Curator could fill this role by monitoring a tier's total disk usage and triggering an early transition to the next tier when a threshold is passed.

I could see Curator following this process:

  1. Check disk usage on all nodes holding shards of matching indices at a matching ILM phase; do nothing if none exceed a specific threshold.
  2. Sort matching indices by age. If any shards of the oldest index are already relocating; do nothing.
  3. Use "POST _ilm/move/" api to relocate the oldest index from its current ILM phase/action/name to the next phase as specified.

Safeguards:

One could use this ability in a couple of ways:

Because ElasticCloud's granularity is powers of 2, if, for example, three 8G warm nodes hold 6 days of data but I want 7, my only options are to decrease my expectations and live with 6 days, or double node size to 16G and have 12 days of space available, whether or not it's used. Plus, if I do go with 6 (or increase to 12), I'm now loving on the edge where any ingest spike could cause me to hit watermarks or exhaust disk space.

This feature would allow me to fully utilize the nodes I have without enabling the fiscally-scary option of auto-scaling. Were I to experience an ingestion spike, Curator could start transitions a little sooner and certain queries might take a few milliseconds longer for a few days, but no auto-scaling would be needed and ingestion would not stop due to disk exhaustion.

What do you think? Do you see the utility in such a feature? Or is there an effort underway to add this ability to ILM soon?