ILM Policy for monitoring-8 indices is keeping data for way more time than expected

lucabelluccini commented 2 years ago

Elasticsearch Version

8.x

Installed Plugins

No response

Java Version

bundled

OS Version

n/a

Problem Description

The default ILM policy for monitoring-8 data streams is incorrect.

https://github.com/elastic/elasticsearch/blob/79a59f470bc5641999ddc7a4bf7e5396958c9844/x-pack/plugin/core/src/main/resources/monitoring-mb-ilm-policy.json

The main problem is we set a rollover with max_age to 3d (hardcoded) and we then define a delete phase derived from the existing setting xpack.stack.monitoring.history.duration.

If a user used to have a xpack.stack.monitoring.history.duration of 3 days, they would end up keeping data for 6 days instead of the expected 3 (except if max_primary_shard_size is reached, then it would be less).

I would propose to switch to a max_age of 1d for the rollover action. It will produce more indices, but it would lead to a similar behavior of "before datastreams".

Also, I would push for updating the documentation on https://github.com/elastic/elasticsearch/issues/85873 and adding a banner mentioning that the data will be kept for N days + 1 day (the currently written index). So users have to expect an extra day worth of monitoring.

Also, be aware the monitoring index template for monitoring-8 do not have auto_expand 0-1, so the indices can become stuck unable to move to the warm phase if a user is on a single data node (as it is unable to allocate the replica). It is a separate issue (https://github.com/elastic/kibana/issues/130885).

Steps to Reproduce

Create a cluster in 7.17 and monitor it with Metricbeat
Upgrade to 8
Observe amount of time data is kept via ILM Explain

Logs (if relevant)

No response

elasticmachine commented 2 years ago

Pinging @elastic/es-data-management (Team:Data Management)

jbaiera commented 2 years ago

Adding context for the discussion that led to these settings: https://github.com/elastic/elasticsearch/issues/81839#issuecomment-1010030569

jbaiera commented 2 years ago

The main problem is we set a rollover with max_age to 3d (hardcoded) and we then define a delete phase derived from the existing setting xpack.stack.monitoring.history.duration.

If a user used to have a xpack.stack.monitoring.history.duration of 3 days, they would end up keeping data for 6 days instead of the expected 3 (except if max_primary_shard_size is reached, then it would be less).

xpack.stack.monitoring.history.duration is not a settable configuration option in the Elasticsearch settings. This is just a pattern variable that we expand when loading the policy data in the registry at start up. The value is overridable with the similarly named xpack.monitoring.history.duration property, but this setting is deprecated and the retention selection is to set a lower bound for backwards compatibility purposes more than to be a hard retention duration.

I agree though, the fact that the final retention value is a little bit longer than the configured value should be documented further. I can't say for sure whether a 1 day rollover is preferable for all deployments though. The initial change discussion focused on a 50gb rollover and the 3 day max age was added to smooth out the retention rate in order to keep small clusters from slowly accumulating to the rollover size.

lucabelluccini commented 2 years ago

The main problem is we set a rollover with max_age to 3d (hardcoded) and we then define a delete phase derived from the existing setting xpack.stack.monitoring.history.duration. If a user used to have a xpack.stack.monitoring.history.duration of 3 days, they would end up keeping data for 6 days instead of the expected 3 (except if max_primary_shard_size is reached, then it would be less).

xpack.stack.monitoring.history.duration is not a settable configuration option in the Elasticsearch settings. This is just a pattern variable that we expand when loading the policy data in the registry at start up. The value is overridable with the similarly named xpack.monitoring.history.duration property, but this setting is deprecated and the retention selection is to set a lower bound for backwards compatibility purposes more than to be a hard retention duration.

Looking at the code, this setting is actually derived from the "old" history setting: https://github.com/elastic/elasticsearch/blob/98dc0eb1e67479835326daee3cf81fb80ba46881/x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/MonitoringTemplateRegistry.java#L245 Which is at https://github.com/elastic/elasticsearch/blob/255bf5056bdbae9cd594f7c3e965b96d33087a39/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/monitoring/MonitoringField.java#L31 (xpack.monitoring.history.duration).

Infact, this issue is exactly targeting users who migrated from the "old" to the new stack monitoring indices.

I agree though, the fact that the final retention value is a little bit longer than the configured value should be documented further. I can't say for sure whether a 1 day rollover is preferable for all deployments though. The initial change discussion focused on a 50gb rollover and the 3 day max age was added to smooth out the retention rate in order to keep small clusters from slowly accumulating to the rollover size.

I also agree the defaults chosen generate less shards.

For very small deployments, this can be critical as it can actually lead to +3 days of retention by default. I fully agree users shouldn't be on the "edge" in terms of storage, but this gives a false sense of "isofunctionality" with the past.

dakrone commented 4 months ago

Internal monitoring has been deprecated for quite a while, and we're no longer doing any active development on it. I'm going to close this issue.

elastic / elasticsearch