Closed lucabelluccini closed 4 months ago
Pinging @elastic/es-data-management (Team:Data Management)
Adding context for the discussion that led to these settings: https://github.com/elastic/elasticsearch/issues/81839#issuecomment-1010030569
The main problem is we set a rollover with max_age to 3d (hardcoded) and we then define a delete phase derived from the existing setting xpack.stack.monitoring.history.duration.
If a user used to have a xpack.stack.monitoring.history.duration of 3 days, they would end up keeping data for 6 days instead of the expected 3 (except if max_primary_shard_size is reached, then it would be less).
xpack.stack.monitoring.history.duration
is not a settable configuration option in the Elasticsearch settings. This is just a pattern variable that we expand when loading the policy data in the registry at start up. The value is overridable with the similarly named xpack.monitoring.history.duration
property, but this setting is deprecated and the retention selection is to set a lower bound for backwards compatibility purposes more than to be a hard retention duration.
I agree though, the fact that the final retention value is a little bit longer than the configured value should be documented further. I can't say for sure whether a 1 day rollover is preferable for all deployments though. The initial change discussion focused on a 50gb rollover and the 3 day max age was added to smooth out the retention rate in order to keep small clusters from slowly accumulating to the rollover size.
The main problem is we set a rollover with max_age to 3d (hardcoded) and we then define a delete phase derived from the existing setting xpack.stack.monitoring.history.duration. If a user used to have a xpack.stack.monitoring.history.duration of 3 days, they would end up keeping data for 6 days instead of the expected 3 (except if max_primary_shard_size is reached, then it would be less).
xpack.stack.monitoring.history.duration
is not a settable configuration option in the Elasticsearch settings. This is just a pattern variable that we expand when loading the policy data in the registry at start up. The value is overridable with the similarly namedxpack.monitoring.history.duration
property, but this setting is deprecated and the retention selection is to set a lower bound for backwards compatibility purposes more than to be a hard retention duration.
Looking at the code, this setting is actually derived from the "old" history setting: https://github.com/elastic/elasticsearch/blob/98dc0eb1e67479835326daee3cf81fb80ba46881/x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/MonitoringTemplateRegistry.java#L245
Which is at https://github.com/elastic/elasticsearch/blob/255bf5056bdbae9cd594f7c3e965b96d33087a39/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/monitoring/MonitoringField.java#L31 (xpack.monitoring.history.duration
).
Infact, this issue is exactly targeting users who migrated from the "old" to the new stack monitoring indices.
I agree though, the fact that the final retention value is a little bit longer than the configured value should be documented further. I can't say for sure whether a 1 day rollover is preferable for all deployments though. The initial change discussion focused on a 50gb rollover and the 3 day max age was added to smooth out the retention rate in order to keep small clusters from slowly accumulating to the rollover size.
I also agree the defaults chosen generate less shards.
For very small deployments, this can be critical as it can actually lead to +3 days of retention by default. I fully agree users shouldn't be on the "edge" in terms of storage, but this gives a false sense of "isofunctionality" with the past.
Internal monitoring has been deprecated for quite a while, and we're no longer doing any active development on it. I'm going to close this issue.
Elasticsearch Version
8.x
Installed Plugins
No response
Java Version
bundled
OS Version
n/a
Problem Description
The default ILM policy for monitoring-8 data streams is incorrect.
https://github.com/elastic/elasticsearch/blob/79a59f470bc5641999ddc7a4bf7e5396958c9844/x-pack/plugin/core/src/main/resources/monitoring-mb-ilm-policy.json
The main problem is we set a
rollover
withmax_age
to 3d (hardcoded) and we then define adelete
phase derived from the existing settingxpack.stack.monitoring.history.duration
.If a user used to have a
xpack.stack.monitoring.history.duration
of 3 days, they would end up keeping data for 6 days instead of the expected 3 (except ifmax_primary_shard_size
is reached, then it would be less).I would propose to switch to a
max_age
of 1d for therollover
action. It will produce more indices, but it would lead to a similar behavior of "before datastreams".Also, I would push for updating the documentation on https://github.com/elastic/elasticsearch/issues/85873 and adding a banner mentioning that the data will be kept for N days + 1 day (the currently written index). So users have to expect an extra day worth of monitoring.
Also, be aware the monitoring index template for monitoring-8 do not have
auto_expand
0-1, so the indices can become stuck unable to move to thewarm
phase if a user is on a single data node (as it is unable to allocate the replica). It is a separate issue (https://github.com/elastic/kibana/issues/130885).Steps to Reproduce
Logs (if relevant)
No response