Observability dependencies view broken for >= 90 days of historical data

cachedout commented 6 months ago

Kibana version: Serverless build 03/12/24 Elasticsearch version: Serverless build 03/12/24 Server OS version: Serverless build 03/12/24 Browser version: N/A Browser OS version: N/A Original install method (e.g. download page, yum, from source, etc.): Serverless build 03/12/24 Describe the bug: When using the Observability test cluster for Serverless QA and selecting 90 days of historical data, an error about too many buckets is displayed.

Steps to reproduce:

Using QA o11y test cluster
Go to Applications -> Dependencies
Select 90 days of historical data

Expected behavior: No error Screenshots (if relevant):

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

Any additional context:

elasticmachine commented 6 months ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

kpatticha commented 6 months ago

neptunian commented 5 months ago

In https://github.com/elastic/kibana/issues/161239, we changed the composite size to 1500 with no pagination. However during a wide enough time range with 1500 unique top level buckets (service name, dependency name), it would still be easy to reach the default elasticsearch limit of 65,536. In the query below the interval for the histogram is daily (86400s) for something like a 3 month time range. 1500 (services/dependencies) * 90 (days) = 135,000 buckets. Not including potentially more buckets depending on event.outcome field (3 possible extra buckets per day).

date histogram creating buckets per day over a 3 month time range:

        "timeseries": {
          "date_histogram": {
            "field": "@timestamp",
            "fixed_interval": "86400s",
            "extended_bounds": {
              "min": 1706629793149,
              "max": 1714488593149
            }
          },

full query:

``` { "track_total_hits": true, "size": 0, "query": { "bool": { "filter": [ { "terms": { "processor.event": [ "metric" ] } }, { "bool": { "filter": [ { "bool": { "filter": [ { "term": { "metricset.name": "service_destination" } } ], "must_not": { "terms": { "metricset.interval": [ "10m", "60m" ] } } } } ] } } ], "must": [ { "bool": { "filter": [ { "bool": { "filter": [ { "term": { "metricset.name": "service_destination" } } ], "must_not": { "terms": { "metricset.interval": [ "10m", "60m" ] } } } }, { "range": { "@timestamp": { "gte": 1706629793149, "lte": 1714488593149, "format": "epoch_millis" } } }, { "bool": { "must_not": [ { "terms": { "agent.name": [ "js-base", "rum-js", "opentelemetry/webjs" ] } } ] } } ] } } ] } }, "aggs": { "connections": { "composite": { "size": 10000, "sources": [ { "serviceName": { "terms": { "field": "service.name" } } }, { "dependencyName": { "terms": { "field": "span.destination.service.resource" } } } ] }, "aggs": { "sample": { "top_metrics": { "size": 1, "metrics": [ { "field": "service.environment" }, { "field": "agent.name" }, { "field": "span.type" }, { "field": "span.subtype" } ], "sort": { "@timestamp": "desc" } } }, "total_latency_sum": { "sum": { "field": "span.destination.service.response_time.sum.us" } }, "total_latency_count": { "sum": { "field": "span.destination.service.response_time.count" } }, "timeseries": { "date_histogram": { "field": "@timestamp", "fixed_interval": "86400s", "extended_bounds": { "min": 1706629793149, "max": 1714488593149 } }, "aggs": { "latency_sum": { "sum": { "field": "span.destination.service.response_time.sum.us" } }, "count": { "sum": { "field": "span.destination.service.response_time.count" } }, "event.outcome": { "terms": { "field": "event.outcome" }, "aggs": { "count": { "sum": { "field": "span.destination.service.response_time.count" } } } } } } } } } } ```

Here are some options:

Smarter time intervals for the date histogram If we choose to widen the intervals for eg 3 months has intervals of 12 buckets (1 per week) instead of 90 (per day), there is less resolution in the charts (perhaps this is okay given how small these charts are), and we should be able to avoid the "too many buckets" exception. Good short term solution if we can accept the tradeoff.
Separate histogram timeseries buckets from services. @crespocarlos mentioned this in https://github.com/elastic/kibana/issues/161239, a request to get the services and separately a request to get timeseries data only for visible services. Better real and perceived performance as the list of services will appear quickly which may be all they need. This is how the Services Inventory works and probably the best long term solution. Note: This API is used in the Services Overview and would need to see how it would affect it.
Smaller composite size to something smaller to make the "too many buckets" exception less likely. Instead of 1500, make it 500 and then paginate 3 times to get 1500 results. The tradeoff is the query will be slower due to multiple requests needing to happen. Good short term solution but I think we are less likely to want to accept a slower query.

neptunian commented 5 months ago

Talked with @smith and going to go with the first option of having larger time intervals which means less buckets.

neptunian commented 4 months ago

@chrisdistasio @paulb-elastic

There's a PR open here https://github.com/elastic/kibana/pull/182884. This fix does not cover very large time ranges like 4+ years with the max amount of dependencies (1500). My thought is perhaps there should be a balance between how many buckets we try to stay under for any time range vs letting the user choose to increase their bucket limit. We can advise the user to increase their default max buckets in this case. If we feel that we should always aim to stay under the max bucket limit, even in a scenario of several years, I can do that. Currently the small time interval is 30 days. For something like 4 years this can be too large and we should switch to something like 3 months. If we want to do this I'd prefer to do that in a separate PR as it will require changes to a function used all over the APM UI with more in depth testing. The better alternative would be to implement the 2nd option, Separate histogram timeseries buckets from service.

paulb-elastic commented 4 months ago

Thanks @neptunian that seems a good and reasonable approach (@chrisdistasio do you see a need for such long time periods?)

@neptunian if the user does select a 4+ year range, what's the user experience, do they still end up with the too_many_buckets_exception? If we wanted to detect that exception and show something like please select a shorter time period, does that also fall into the bigger piece of work you mentioned?

neptunian commented 4 months ago

if the user does select a 4+ year range, what's the user experience, do they still end up with the too_many_buckets_exception? If we wanted to detect that exception and show something like please select a shorter time period, does that also fall into the bigger piece of work you mentioned?

Yes, they will still get the error with a "failed to fetch" in the table. With the "Separate histogram timeseries buckets from service" I mentioned, they would be unlikely to get the error because we'd only get timeseries data for the services they are looking at (defaults to 25 per page and we can make it lower). A significant part of the problem is getting timeseries data for ALL the services they have, even though they can't look at all of them anyway (defaults to 25 items per page in the table and can be set lower).

I think the current error that tells them to adjust their settings so they can get more buckets is helpful and we should keep it, but I understand they don't know exactly why and what they can do to remedy it other than changing their bucket size. So adding that kind of messaging could be helpful. "There is too much data being returned. Adjust your cluster bucket size (same as current messaging about adjust bucket size) or try narrowing your time range." This messaging comes from elasticsearch so we'd have to parse it and append some extra messaging to suggest narrowing the timerange. It would show up for all the ES queries that encounter the exception in APM and may not be helpful in some contexts if the timerange is not a significant contributor to the bucket size.

elastic / kibana

Observability dependencies view broken for >= 90 days of historical data #178491