elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.71k stars 8.13k forks source link

Allow resolution of Data View without resolving all fields #139340

Closed miltonhultgren closed 2 months ago

miltonhultgren commented 2 years ago

When calling dataViewsService.get(dataViewId) the fields inside that data view are resolved at the same time, which adds a decent chunk to the time-to-resolution which also blocks rendering until that is done. There are cases in the Logs and Metrics UI where we would prefer to defer the fields resolution to a later stage yet still integrate with the Data Views service (for example, use the index pattern and timestamp field but not offer auto completion until later).

Would it be possible to make fields resolution optional until requested?

elasticmachine commented 2 years ago

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

elasticmachine commented 2 years ago

Pinging @elastic/kibana-app-services (Team:AppServicesSv)

mattkime commented 2 years ago

Would it be possible to make fields resolution optional until requested?

It is, but I'd like to have a thorough understanding before this is implemented. Generally speaking, we expect field list loading to be fast so I'm curious about the cases where this isn't true.

What is the priority on this? Is it tied to any high priority items?

weltenwort commented 2 years ago

The _field_caps call that is performed to load the field list can take a while when using CCS. It is a common deployment topology for observability to have region-specific or team-specific monitoring/logging clusters and then combine several of those via CCS in cross-region/team clusters.

In those situations the get() becomes a bottleneck for the UI since it has to wait for tens of seconds until the _field_caps returns and the data view instance is returned.

mattkime commented 2 years ago

@weltenwort Are the speed concerns still an issue with the current state of _field_caps? https://github.com/elastic/elasticsearch/issues/84504

As best I know, there's no way it should be taking tens of seconds.

Taking a step back, I'm happy to provide data views with async field loading, but I want to make sure I understand our current limitations.

weltenwort commented 2 years ago

The _field_caps request with CCS probably needs to wait for the slowest cluster. This is an example trace with just one lightly loaded remote cluster:

image

You can see that the call made while loading the data view takes 4 s with 3.3 s of that being taken up by the _field_caps call to ES.

mattkime commented 2 years ago

@weltenwort Which version of the stack is that? I might be interested to hear other relevant details - what does 'lightly loaded' mean? How many fields?

I'm being stubborn about this because _field_caps being relatively fast is a core assumption. Overturning that would involve a fair amount of work and therefore diligence. Maybe these are the first steps.

Ideally we'd be seeing sub-second responses.

mattkime commented 2 years ago

@dnhatn I noticed your work on benchmarks for the field caps api. Do we have a better idea of what we can expect performance wise?

matschaffer commented 2 years ago

Wondering if @pugnascotia 's ES tracing work might help confirm what we're waiting on for those 3.3s 🤔

pugnascotia commented 2 years ago

It would at least give you an idea what tasks are being executed.

weltenwort commented 2 years ago

The clusters are managed by the observability dev productivity team's tooling. These are the details I could find, where "production" is the cluster that my Kibana instance runs on and "remote" is the cluster that is accessed via CCS:

production cluster

remote cluster

Is there a way we can enable tracing on those clusters in a non-destructive way?

dnhatn commented 2 years ago

I think 3 seconds is possible if the cluster has 1000+ indices. We have another optimization in https://github.com/elastic/elasticsearch/pull/86323. However, it's still un-merged. I will try to get it in this week. This optimization should reduce the latency to sub-seconds.

weltenwort commented 2 years ago

I think 3 seconds is possible if the cluster has 1000+ indices.

Right, this is not about a few seconds being too slow when the request hits that number of indices. It's about not being able to avoid it when loading a data view even when the component doesn't need the field list right away.

This optimization should reduce the latency to sub-seconds.

That sounds amazing, thank you.

mattkime commented 2 years ago

I'm glad we had this discussion to help emphasize the importance of @dnhatn 's optimization work.

javanna commented 2 years ago

Thanks for making this connection @mattkime . Please ping us whenever you hit this kind of problems around calling field_caps (or any other API really), otherwise we don't even get to know that there are issues that you folks are looking to work around :)

miltonhultgren commented 2 years ago

So is the conclusion that we aim to improve the performance of field resolution to be so fast that it's not an issue to resolve them even if they're not needed at all times? And we expect that even for CCS use cases this will still be fast enough to not block rendering noticeably?

mattkime commented 2 years ago

@miltonhultgren Yes, although these aren't necessarily mutually exclusive paths. What is the use case for loading a data view without the fields? I'd like to get into the details of what you're doing since I'll often learn something useful. Yes, I understand that initially you just need the index pattern and timestamp field but I'd still like to learn more.

It looks like the case that might have taken 3s will now take about 0.3s. Is 0.3s meaningful in this case? I'm unaware of time to load being optimized to this degree elsewhere.

All the data view code assumes the field list exists once a DataView instance has been initiated. This would be a significant change. If we were to rewrite the data views code, I'd definitely defer loading the field list. I'm trying to figure out the priority of making this change.

miltonhultgren commented 2 years ago

What is the use case for loading a data view without the fields?

We have two use cases today, one in Logs and one for a Lens table to shows host metrics. In Logs, we use the data view to resolve which indices to load logs from and we use the timestamp field as a tie breaker for sorting (I think). In the Logs case we do want the fields but at a later time, to suggest fields for filters or change which fields to show from the log document but this doesn't need to block the initial page load.

For the new Lens table, we don't need the fields at all since we simply want to load the right metrics from the right index and no auto completion needs to happen for that table (though, later it might be filtered through unified search).

In the rest of the Metrics UI we follow a similar pattern, initially we only load the metrics from the right index and defer the field resolution until a bit later when it's needed.

So it really just boils down to wanting to defer work for later so that initial render with useful data can happen quicker.

Is 0.3s meaningful in this case?

No, I don't think so.

This would be a significant change.

Understood, I think we'd do best to wait and see how the optimization performs, specially in CCS setups with slower networks/remotes and at what percentile we might have such load times. We'll also need to gather more accurate data on this, preferably from real deployments that are properly sized for the workload (the Edge cluster isn't).

mattkime commented 2 years ago

Sounds good. I'll think about how we might do this as smaller efforts instead of one big push.

dnhatn commented 1 year ago

I have merged https://github.com/elastic/elasticsearch/pull/86323. I think it should unblock the work here.

elasticmachine commented 1 year ago

Pinging @elastic/kibana-data-discovery (Team:DataDiscovery)

kertal commented 1 year ago

So the ask would be to e.g. add a param to dataViewsService.get(dataViewId), allowing to get the data fields without resolving all fields, or create a separate function like getWithoutFields, right? sounds like a feature request, more than a bug

StephanErb commented 1 year ago

I thought I'd share a bit of experience from the field: We have now updated our production cluster to 8.6.1 with the latest field cap improvement (https://github.com/elastic/elasticsearch/pull/86323). Unfortunately performance is still not optimal for us: field_caps for metricbeat* is still in the 10-30s range. This runtime appears to be mostly dominated by frozen nodes with 800-1000 shards. The mappings are dynamic so most indices have slightly different mappings. The mapping have several thousand fields.

I also fear that the problem will get worse with TSDB and syntheric source. With the good compression ratio of TSDB combined with the planned primary shard cap at 200M documents (https://github.com/elastic/elasticsearch/issues/87246) a single frozen node will be holding significantly more shards in the future. I would thus expect performance to deteriorate further.

mattkime commented 1 year ago

@StephanErb

This runtime appears to be mostly dominated by frozen nodes with 800-1000 shards.

Having frozen indices within the metricbeat-* index pattern isn't something we've worked on as placing data on frozen tiers is a choice to lower cost at the expense of speed.

I think the solution should be to make sure the frozen indices are not available to the `metricbeat-* index pattern. Is this possible? Is something in the way?

StephanErb commented 1 year ago

Having frozen indices within the metricbeat-* index pattern isn't something we've worked on as placing data on frozen tiers is a choice to lower cost at the expense of speed.

I would expect that querying data on frozen nodes leads to a slowdown. However, the mere presence of data on frozen outside of the queried time range should not have a performance impact. At least that my team and I have assumed so far.

We have Kubernetes and Prometheus metrics in metricbeat-*. We use an ILM policy such as after 2 days data is transitioned from hot to warm, after 7 days from warm to cold, and finally after 30 from cold to frozen. Most of our alerts and dashboards look at the last 24 hours of data. Dashboard occasionally also look at 7 and 30 days as those are default time filters in Kibana. Ranges >30 days are almost never queried.

Given that a field_caps query does not contain a time range parameter, I can somewhat see where the problem is coming from. However, as frozen nodes are not used for indexing new data, fields on them should be rather static and hopefully cachable.

javanna commented 1 year ago

Field_caps does support for providing the time_range filter, and runs the can_match phase to filter out irrelevant shards.

The mappings are dynamic so most indices have slightly different mappings. T

I am suspecting this is the main issue, as the performance improvements build up on deduplication of mappings that have same hash, which is not the case if there are slight changes between the different indices.

Field_caps performance does not have to do with the number of shards though, but rather with the number of indices having distinct mappings. Would be great to get more feedback here to see what we can improve further. Could you open an sdhe around this?

elasticmachine commented 10 months ago

Pinging @elastic/obs-ux-logs-team (Team:obs-ux-logs)

elasticmachine commented 10 months ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

kertal commented 10 months ago

Yes, we intend to do this, the next step in this direction will be https://github.com/elastic/kibana/issues/167750

mattkime commented 4 months ago

@miltonhultgren DataViewLazy has been partially implemented. Can you look and see if its useful for your needs? Fields are only loaded as requested, potentially saving a lot of overhead compared to regular DataViews.

miltonhultgren commented 4 months ago

I'm no longer involved in the apps where we used DataViews that lead to me opening this issue.

@weltenwort @neptunian Is this something that you guys could look at within the current logs and metrics code bases?

weltenwort commented 4 months ago

thanks for the pointer. we have https://github.com/elastic/kibana/issues/179128 to track its usage in the log threshold alert

kertal commented 2 months ago

@mattkime I think we can close this due to https://github.com/elastic/kibana/issues/167750?