Summary

We need to monitor collector performance to ensure that the telemetry footprint is low. We can surface these metrics in the usage data, CI/tests, and for devs during development.

Impact and Concerns

Labeling as Impact: High since this ensures future scalability of our telemetry and puts a system in place to enable performance optimizations for our collection methods. This also helps reduce the number of people opting out of telemetry in cases where telemetry is causing significant spikes in resources.

Acceptance criteria

Metrics around the time it takes for each or all collectors to complete fetching the data. Metrics around the number of requests per day against the stats endpoint

Potential solutions

Monitoring collector fetch performance and report results in telemetry

log warning on a threshold in dev
list all cases where we do PIT search inside collectors
collaborate with teams to check why they need to do this search instead of an aggregation
provide tools to solve these issues
set a timeline to disable PIT search in collectors
add telemetry to track the time it is making the request

Notes

rudolf commented 2 years ago

Although slow collectors could be a sign that collection is expensive it doesn't give us the whole picture. A slow collector probably shows that this operation is expensive for Elasticsearch (i.e. an aggregation over a large amount of data), but doesn't tell us how it impacts Kibana.

In addition, we should track the Elasticsearch response length as serializing JSON has the biggest impact on Kibana performance and inefficient code that loops over an ES response is likely to also be slower the larger the response.

That can help us diagnose the following:

Is collection slow because ES is slow or Kibana is slow?
Is Kibana slow because of telemetry collection or because of another problem?
Is collector X the cause of the performance problem or is it slow because another collector caused a performance problem?

Having telemetry on telemetry can give us helpful summary details, but we loose a lot of resolution by looking at snapshots. If we instrument proxy logs we can do temporal analysis e.g. does the event loop spike after refreshing the usage collection cache?

afharo commented 2 years ago

IMO, we should aim to create this collector as simple as possible: This will help us detect the common offenders when requesting telemetry: i.e https://github.com/elastic/kibana/issues/123154

However, I agree that additional inputs will be critical. Like https://github.com/elastic/kibana/issues/122516

How does it sound if we set the scope of this issue to collect the time it takes for each collector to complete, and use other items like https://github.com/elastic/kibana/issues/122516 to understand the underlying requests and any possible event-loop delays derived from it?

afharo commented 2 years ago

122516 was done!

I'm wondering though if we should implement this as telemetry or APM transaction/spans? Ideally, we should catch and fix this before changes are released. What approach would help us best to track these metrics, and potentially fix them before we release the offending version?

What do you think?

Bamieh commented 2 years ago

@afharo yea it is reasonable to have a way to catch any issues before the release however grabbing telemetry performance from the real world is invaluable and will give us deep insights and allows us to be proactive catching niche issues before they reach the levels of affecting the average clusters

elastic / kibana