elastic / uptime

This project includes resources and general issue tracking for the Elastic Uptime solution
12 stars 3 forks source link

Separate Index lifecycle policies for each dataset #462

Closed paulb-elastic closed 2 years ago

paulb-elastic commented 2 years ago

Following on from the spike uptime#453, this is the implementation issue to get this feature added (understanding that the platform does not currently allow separation based on namespace, but may do in the future).

As a user of Synthetic Monitoring (lightweight and browser) I want to be able to configure different life cycles for Elasticsearch result data from my monitors So that I can balance cost of storage against the value of that data, over longer periods

ACs:

Each data set should have a delete phase with the following specifications

365 days HTTP lightweight checks (synthetics-http) ICMP lightweight checks (synthetics-icmp) TCP lightweight checks (synthetics-tcp)

Browser checks - core data (synthetics-browser): 365 days

Browser checks - network data (synthetics-network): 14 days

Browser checks - screenshots: 14 days

Reference to the docs task to clarify this in the documentation

joshdover commented 2 years ago

The platform automatically configures separate ILM policies for each type of data collected from Synthetic monitoring (lightweight and browser)

From my understanding this should be implemented by adding new ILM policies to the synthetics package, and there should not be any custom Kibana code needed to support this.

One thing to note is the current issues we have around privileges for ILM policies, which will likely require some privilege changes to Elasticsearch's kibana_system role definition. Please see https://github.com/elastic/package-spec/issues/293 for an explanation of the overall problem and https://github.com/elastic/elasticsearch/pull/85085 as an example of how to do this.

paulb-elastic commented 2 years ago

Thank you @joshdover for the additional insight

dominiqueclarke commented 2 years ago

@drewpost @paulb-elastic

We need default storage tier and retention period definitions to begin this ticket, and to evaluate what permissions, if any, the kibana_system user will need to support this feature.

drewpost commented 2 years ago

Sorry @dominiqueclarke - This has been documented but it appears it didn't get into this ticket. Here's the defaults:

365 days HTTP lightweight checks (synthetics-http) ICMP lightweight checks (synthetics-icmp) TCP lightweight checks (synthetics-tcp)

Browser checks - core data (synthetics-browser): 365 days

Browser checks - network data (synthetics-network): 14 days

Browser checks - screenshots: 14 days

In terms of the storage tiers, do we have any understanding of the real world speed tradeoffs here?

dominiqueclarke commented 2 years ago

@paulb-elastic

While Drew's requirements are clear, there are many ways we can implement a 14 day retention period for network data and screenshots.

When configuring phases, a max age or max size of the write index is specified. If either the max age or the max size is reached for the main write index, a rollover occurs, a new write index is created, and the old write index becomes a read index. It's from this point of rollover that the delete phase countdown begins.

For example, let's take the default hot phase rollover requirements of, 30 days old or any primary shard reaches 50 gigabytes, and assume a delete phase timeline of 14 days as specified by Drew. This means that data will be deleted after a max of 44 days, but could be shorter if the index reaches 50 gigabytes sooner.

The question becomes: what combination of hot phase and delete phase configuration should we employ to get close to our target of deleting data after 14 days?

We could, for example, do any number of combinations HOT DELETE
Max age 1 day 14 days
Max size 50 gigabytes -
HOT DELETE
Max age 1 day 13 days
Max size 50 gigabytes -
HOT DELETE
Max age 3 day 11 days
Max size 50 gigabytes -

etc etc.

@paulb-elastic thoughts?

paulb-elastic commented 2 years ago

From a time perspective, it seems having a hot of 14 days and then a delete of 0 days would meet @drewpost’s default requirement (I haven’t tried it in action, but I can certainly build a policy with this configuration). That’s for time based.

You raise an interesting point about the size element too. Does this have to be set? In the UI, I seem to be able to not set this (same as the max number of documents for example). This would probably be the best default. I don’t know if we have a good handle on how big is enough to never reach the size limit, so as always to only hit the age limit. Ideally we’d not set that (these are just the defaults too, users can still customise these for their preferred configuration).

image

image

awahab07 commented 2 years ago

Post FF Testing

Hot Phase - Maximum Age (days) Hot Phase - Maximum Index Size (GB) Delete Phase - Max Age (days)
browser-default_policy 30 50 365
browser_network-default_policy 1 default 14
browser_screenshot-default_policy 1 default 14
http-default_policy 30 50 365
tcp-default_policy 30 50 365
browser-default_policy 30 50 365