[Stack Monitoring] Ingest pipeline monitoring

krisATelastic commented 5 years ago

@joshdover has hijacked this issue and re-written most of the content below as of Oct 2022

User Story and requirements

Users need end to end monitoring of ingest pipelines in order to understand which ingest tier has performance bottlenecks and root cause expensive pipelines or processors. Specifically they're looking for the following features:

Overview stats of pipelines running
- How many pipelines are configured
- How many events per second (EPS) are being processed by a pipeline
- How much CPU time is being consumed by each pipeline in the given time window (similar to APM's "impact" rating for transactions)
- Failure rate
Detailed view of a pipeline
- Similar stats as above, but on each processor in the pipeline: EPS, CPU time, failure rate

Nice to haves:

Memory usage of a pipeline and it's processors
Which node(s) a pipeline is running inside the cluster, and the inverse: which pipelines are consuming resources on a given node.

Existing metrics

Elasticsearch already exposes ingest pipeline metrics on the Nodes Stats API, including docs processed and CPU time on a pipeline and processor basis. These metrics are not currently ingested by our existing Metricbeat modules.

Proposed path forward

Stack Monitoring is in the middle of a transitional period which complicates how we make first steps here.

First, Stack Monitoring is not yet supported via Agent, only Metricbeat. This will be solved relatively soon by the Infra Monitoring UI team in https://github.com/elastic/kibana/issues/120415.

Second, there's been recent alignment and acceptance of an RFC to move towards an OTel-based solution for exposing metrics from stack components, away from Metricbeat modules. This is a great step forward that will allow teams to expose new metrics and have them visible in the UI with much less coordination across various teams and little to no UI development work required. This is still quite fresh and Elasticsearch does not yet support pushing OTel metrics to an OLTP endpoint, though it has recently added support for OTel traces.

[Done ✅] Phase 1: Add a new metricset to Metricbeat and add a dashboard to the upcoming Elasticsearch monitoring package for Elastic Agent

PRs:

We can likely get a basic solution out to help our customers quickly by building on top of the package being developed in https://github.com/elastic/kibana/issues/120415

This would involve:

Add a new metricset to Metricbeat to collect ingest_pipeline stats
Add a new ingest pipeline data stream to the package that can process these incoming docs and convert them into ECS. These would likely be similar to the document structure proposed in elastic/kibana#130078
Add two new dashboards to the package, one for ingest pipelines overall and a detailed view of a specific pipeline with more details about it's processors. These dashboard would be linked by drilldowns.
Add a UI link from Stack Monitoring to the dashboard, if the user does not have the ES package installed, prompt them to install first.

The existing metricbeat module for es monitoring already polls these stats, but doesn't ingest any of the data. Enhancing Metricbeat directly with these stats has the benefit of being able to collect this data on Cloud deployments with the native "Logs and Metics" feature. This implementation would also be directly usable by the agent package.

Phase 2: Instrument Elasticsearch ingest pipelines w/ OTel traces and push to APM Server

Elasticsearch recently added support for shipping OTel traces to internal clusters for monitoring. This is not yet ready for end-user features, but would likely be a good fit for pipeline monitoring. A similar idea is discussed in https://github.com/elastic/kibana/issues/137141 for Logstash pipelines.

This phase would be in line with the long-term solution for Platform Observability, to use the OTel SDKs to push metrics to an OTLP endpoint, like APM Server.

One helpful piece of this is that instrumentating with traces that are pushed by Elasticsearch to an APM Server does not require Cloud to upgrade to Elastic Agent at the same time.

This phase would involve:

Adding trace instrumentation to Elasticsearch ingest pipeline code, incl metrics like CPU time and bytes/docs processed.
Enabling customers to send Elasticsearch traces to their own cluster on Cloud
(optional) building some tailored dashboards for visualizing these traces

Phase 3: Add OTel metrics to Elasticsearch

In this phase we would add high-level ingest pipeline metrics that are exported via OTel, rather than the Node Stats API. These would replace the stop-gap solution introduced in phase 1 by allowing us to add more, richer metrics without requiring changes to the Stack Monitoring UI and Metricbeat modules.

This phase would involve:

Adding OTel metrics SDK to ES
Adding OTel instrumentation ingest pipelines with metrics
Allowing APM Server to route metrics to the same metric data streams used by the regular Elasticsearch monitoring package (from https://github.com/elastic/kibana/issues/120415)
- Similar to https://github.com/elastic/package-spec/issues/315
- Allows the dashboards in phase 1 to be populated by the data produced in phase 3.

Worklist - outdated

Research
- [x] https://github.com/elastic/kibana/issues/129351
Design
- [x] elastic/kibana#130078
Milestone 1
- TBD

References

https://github.com/elastic/kibana/issues/120415
(internal) https://github.com/elastic/observability-dev/issues/2055
(internal) https://github.com/elastic/observability-dev/issues/2085
Rally implementation from @b-deam (https://github.com/elastic/rally/pull/1416)
Implementation from @andrewkroh https://github.com/andrewkroh/go-ingest-node-metrics
https://github.com/elastic/beats/pull/28479 - not ingest pipelines, but at least indexing pressure stats

Original issue description

**Describe the feature:** Elasticsearch Ingest node monitoring **Describe a specific use case for the feature:** As more native tools like Beats are being pushed to Ingest nodes as ingest pipelines are tied to some of the developed modules, it would be good to expand Monitoring within Kibana to include the ingest pipelines statistics from node stats, i.e. `GET _nodes/stats?filter_path=nodes.*.ingest.*`. This responds with ``` { "nodes" : { "JokQCpQpTAmZLQD9XjwbAA" : { "ingest" : { "total" : { "count" : 0, "time_in_millis" : 0, "current" : 0, "failed" : 0 }, "pipelines" : { "xpack_monitoring_6" : { "count" : 0, "time_in_millis" : 0, "current" : 0, "failed" : 0, "processors" : [ { "script" : { "count" : 0, "time_in_millis" : 0, "current" : 0, "failed" : 0 } }, { "gsub" : { "count" : 0, "time_in_millis" : 0, "current" : 0, "failed" : 0 } } ] }, "xpack_monitoring_7" : { "count" : 0, "time_in_millis" : 0, "current" : 0, "failed" : 0, "processors" : [ ] } } } } } } ``` Logstash has pipeline monitoring, it would be good to also have the same feature for Ingest.

elasticmachine commented 5 years ago

Pinging @elastic/es-ui

elasticmachine commented 5 years ago

Pinging @elastic/stack-monitoring

cachedout commented 5 years ago

Hi @krisATelastic. Thanks for this suggestion! We've had this requested a number of times and we're definitely looking into the feasibility.

krisATelastic commented 5 years ago

That's great it's already on the radar! For housekeeping, quite happy to +1 the original issue and close this one out, if that is easiest for the team.

djmcgreal-cc commented 4 years ago

Hi. Firstly thanks for all the good work so far! It seems like a gap in 360 monitoring to not be able to get something similar from ingest nodes. Is there a workaround I'm missing? Thanks, Dan.

tadgh commented 3 years ago

Just piling on here to say I'd love to see this added. The stats API is fine for reading it, but would be nice to see it in kibana monitoring

cjcenizal commented 3 years ago

elasticmachine commented 2 years ago

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

joshdover commented 2 years ago

I've updated this issue based on the most recent information I have gathered in the past week after discussing with several folks.

cc @mukeshelastic @amitkanfer @ruflin @skearns64 @tylerperk @qhoxie who were involved in recent discussions.

joshdover commented 2 years ago

Another appealing option as highlighted to me by @miltonhultgren would be to start by tracing ingest pipelines using Elasticsearch's new tracing support which traces ships to APM Server and are viewable in the APM UI. The same solution was suggested for Logstash pipelines in https://github.com/elastic/kibana/issues/137141.

amitkanfer commented 2 years ago

@joshdover do we want to have metrics around ingest volume? (in bytes)

andrewkroh commented 2 years ago

This phase would involve:

Adding trace instrumentation to Elasticsearch ingest pipeline code, incl metrics like CPU time and bytes/docs processed.

This might be out of scope for this issue. With respect to tracing, one of the questions we would likely by trying to answer is "why is my _bulk request taking so long?" If the tracing could start at the bulk request it would help us to answer that. Potentially this would give us visibility into decompression time, how long the request waited in the write threadpool queue, json deserialization time, etc.

joshdover commented 1 year ago

A beta for phase 1 is shipping in 8.7, enabled by default for Cloud customers. Further explorations of other phases or implementations should be tracked in separate issues.

elastic / kibana