Stack Capacity Monitoring requirements

mikecali commented 1 year ago

Describe the feature: An ability to monitor and report the Storage Capacity and usage of the Elastic data nodes. An ability to monitor and report Ingest rates.

Describe a specific use case for the feature: As a user of ES Stack, I wanted to monitor the capacity of my storage and ingest rates so that I can act on them proactively instead of surprising issues that can cause downtime in the future. This is specifically true for on-prem deployment where adding capacity is not as elastic as the public cloud can provide.

The ask: Ingest Rate monitoring and reporting are needed especially if there is a sudden data flooding due to applications issues/and or devs enabling application tracing. If not monitored or reported the issue can cause downtime of the stack causing data loss and frustrations.

Storage monitoring and reporting are needed so that the team can proactively handle potential growth and have a dashboard to show for the management. Without this, it is usually a manual task that an operator needs to do depending on the needs.

elasticmachine commented 1 year ago

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

matschaffer commented 1 year ago

If it's helpful in the short term, you can build ingest rate and time visualizations from the existing index stat data.

The fields are *.indexing.* on https://www.elastic.co/guide/en/beats/metricbeat/current/exported-fields-elasticsearch.html#_index_3

https://github.com/elastic/beats/blob/main/metricbeat/module/elasticsearch/_meta/README.md#index has links to both the metricbeat fields and ES API information.

For an (internal) example check out the "Indexing time per index" visualization on https://monitor.eu-west-1.aws.qa.cld.elstc.co/app/dashboards#/view/Logstash-Overview

nicpenning commented 3 weeks ago

We often get asked what our data ingest per GB per day is and that question is difficult to answer. For any capacity planning for cloud migration or future on prem expansion, this would be a great metric to have on the overview section of the cluster or in the indices would be great. Bonus points for seeing this for data streams in the index management page. Also, having the data be aggregate by any time aggregate such as minutes, hours days, weeks, etc..

Also having a 95th percentile of retention days of the data would also be great on an event that there are some data streams that might have a few months of test data before ramping up to full production uses.

Would love to see this!

miltonhultgren commented 3 weeks ago

cc @consulthys

consulthys commented 3 weeks ago

Thanks for the ping @miltonhultgren Storage monitoring/reporting/alerting is already supported in AutoOps and Ingest Rate monitoring/reporting/alerting is also being added (in the context of Elastic Cloud Serverless monitoring for now). Stay tuned...

nicpenning commented 3 weeks ago

My guess is Elastic needs it in the serverless cloud since that is how the pricing/licensing model works. 🤔

That was part of the ask here, what would our stack look like on serverless? Not easy to answer.

elastic / kibana

Stack Capacity Monitoring requirements #142160