elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.76k stars 8.17k forks source link

Observability dependencies view broken for >= 90 days of historical data #178491

Closed cachedout closed 4 months ago

cachedout commented 6 months ago

Kibana version: Serverless build 03/12/24 Elasticsearch version: Serverless build 03/12/24 Server OS version: Serverless build 03/12/24 Browser version: N/A Browser OS version: N/A Original install method (e.g. download page, yum, from source, etc.): Serverless build 03/12/24 Describe the bug: When using the Observability test cluster for Serverless QA and selecting 90 days of historical data, an error about too many buckets is displayed.

Steps to reproduce:

  1. Using QA o11y test cluster
  2. Go to Applications -> Dependencies
  3. Select 90 days of historical data

Expected behavior: No error Screenshots (if relevant):

Screenshot 2024-03-12 at 12 59 33

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

Any additional context:

elasticmachine commented 6 months ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

kpatticha commented 6 months ago

Related ticket: https://github.com/elastic/kibana/issues/161239

neptunian commented 5 months ago

In https://github.com/elastic/kibana/issues/161239, we changed the composite size to 1500 with no pagination. However during a wide enough time range with 1500 unique top level buckets (service name, dependency name), it would still be easy to reach the default elasticsearch limit of 65,536. In the query below the interval for the histogram is daily (86400s) for something like a 3 month time range. 1500 (services/dependencies) * 90 (days) = 135,000 buckets. Not including potentially more buckets depending on event.outcome field (3 possible extra buckets per day).

date histogram creating buckets per day over a 3 month time range:

        "timeseries": {
          "date_histogram": {
            "field": "@timestamp",
            "fixed_interval": "86400s",
            "extended_bounds": {
              "min": 1706629793149,
              "max": 1714488593149
            }
          },
full query: ``` { "track_total_hits": true, "size": 0, "query": { "bool": { "filter": [ { "terms": { "processor.event": [ "metric" ] } }, { "bool": { "filter": [ { "bool": { "filter": [ { "term": { "metricset.name": "service_destination" } } ], "must_not": { "terms": { "metricset.interval": [ "10m", "60m" ] } } } } ] } } ], "must": [ { "bool": { "filter": [ { "bool": { "filter": [ { "term": { "metricset.name": "service_destination" } } ], "must_not": { "terms": { "metricset.interval": [ "10m", "60m" ] } } } }, { "range": { "@timestamp": { "gte": 1706629793149, "lte": 1714488593149, "format": "epoch_millis" } } }, { "bool": { "must_not": [ { "terms": { "agent.name": [ "js-base", "rum-js", "opentelemetry/webjs" ] } } ] } } ] } } ] } }, "aggs": { "connections": { "composite": { "size": 10000, "sources": [ { "serviceName": { "terms": { "field": "service.name" } } }, { "dependencyName": { "terms": { "field": "span.destination.service.resource" } } } ] }, "aggs": { "sample": { "top_metrics": { "size": 1, "metrics": [ { "field": "service.environment" }, { "field": "agent.name" }, { "field": "span.type" }, { "field": "span.subtype" } ], "sort": { "@timestamp": "desc" } } }, "total_latency_sum": { "sum": { "field": "span.destination.service.response_time.sum.us" } }, "total_latency_count": { "sum": { "field": "span.destination.service.response_time.count" } }, "timeseries": { "date_histogram": { "field": "@timestamp", "fixed_interval": "86400s", "extended_bounds": { "min": 1706629793149, "max": 1714488593149 } }, "aggs": { "latency_sum": { "sum": { "field": "span.destination.service.response_time.sum.us" } }, "count": { "sum": { "field": "span.destination.service.response_time.count" } }, "event.outcome": { "terms": { "field": "event.outcome" }, "aggs": { "count": { "sum": { "field": "span.destination.service.response_time.count" } } } } } } } } } } ```

Here are some options:

neptunian commented 5 months ago

Talked with @smith and going to go with the first option of having larger time intervals which means less buckets.

neptunian commented 4 months ago

@chrisdistasio @paulb-elastic

There's a PR open here https://github.com/elastic/kibana/pull/182884. This fix does not cover very large time ranges like 4+ years with the max amount of dependencies (1500). My thought is perhaps there should be a balance between how many buckets we try to stay under for any time range vs letting the user choose to increase their bucket limit. We can advise the user to increase their default max buckets in this case. If we feel that we should always aim to stay under the max bucket limit, even in a scenario of several years, I can do that. Currently the small time interval is 30 days. For something like 4 years this can be too large and we should switch to something like 3 months. If we want to do this I'd prefer to do that in a separate PR as it will require changes to a function used all over the APM UI with more in depth testing. The better alternative would be to implement the 2nd option, Separate histogram timeseries buckets from service.

paulb-elastic commented 4 months ago

Thanks @neptunian that seems a good and reasonable approach (@chrisdistasio do you see a need for such long time periods?)

@neptunian if the user does select a 4+ year range, what's the user experience, do they still end up with the too_many_buckets_exception? If we wanted to detect that exception and show something like please select a shorter time period, does that also fall into the bigger piece of work you mentioned?

neptunian commented 4 months ago

if the user does select a 4+ year range, what's the user experience, do they still end up with the too_many_buckets_exception? If we wanted to detect that exception and show something like please select a shorter time period, does that also fall into the bigger piece of work you mentioned?

Yes, they will still get the error with a "failed to fetch" in the table. With the "Separate histogram timeseries buckets from service" I mentioned, they would be unlikely to get the error because we'd only get timeseries data for the services they are looking at (defaults to 25 per page and we can make it lower). A significant part of the problem is getting timeseries data for ALL the services they have, even though they can't look at all of them anyway (defaults to 25 items per page in the table and can be set lower).

I think the current error that tells them to adjust their settings so they can get more buckets is helpful and we should keep it, but I understand they don't know exactly why and what they can do to remedy it other than changing their bucket size. So adding that kind of messaging could be helpful. "There is too much data being returned. Adjust your cluster bucket size (same as current messaging about adjust bucket size) or try narrowing your time range." This messaging comes from elasticsearch so we'd have to parse it and append some extra messaging to suggest narrowing the timerange. It would show up for all the ES queries that encounter the exception in APM and may not be helpful in some contexts if the timerange is not a significant contributor to the bucket size.