elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.65k stars 8.23k forks source link

[Fleet] Improve data streams API efficiency #116428

Open hop-dev opened 3 years ago

hop-dev commented 3 years ago

Kibana version:

7.15.0, 7.16.0, master

Description of the problem including expected versus actual behavior:

Originally pointed out by @joshdover here:

The data stream view can be quite slow to load when there are a lot of streams. We currently get all data streams in one request without pagination and perform an aggregation per data stream.

This issue is to look into ways of improving the performance, current options discussed:

1. Using the data stream name to extract the type, dataset and namespace instead of aggregating

Currently, there is no guarantee that the constant_keyword values in the data match the data stream name. @ruflin suggested we could look at putting a feature request for elastic to validate the constant keywords against the data stream name allowing us to rely on this link.

However, we are now looking at adding another aggregation as part of https://github.com/elastic/integrations/issues/768 so there may no longer be a big efficiency gain to be found here.

2. Introducing pagination

We could introduce pagination to limit the work we do, however there would be some challenges:

3. Combine individual aggregations into one aggregation I am not sure this is possible. We could find a way to use filters and sub aggregations to get the namespace, dataset and type for each data stream in one query. We would need to be able to distinguish each data stream using a filter query I believe and the only way to distinguish them would be to use the values we are querying for!

Steps to reproduce:

  1. Setup Fleet & Fleet Server
  2. Create an agent policy with many integrations to create many data streams
  3. Go to /app/fleet/data-streams
  4. Note that the page can be quite slow to load
elasticmachine commented 3 years ago

Pinging @elastic/fleet (Team:Fleet)

joshdover commented 3 years ago

@elastic/kibana-stack-management have you all solved optimizing your usage of the Data Streams stats API? I noticed that by default, stats are excluded from your Data Streams UI (you have to switch on a toggle in the top right). Curious if there's any history behind this decision and if we should also consider excluding stats by default or removing them from the list view entirely.

cjcenizal commented 3 years ago

@joshdover We haven't had an opportunity to revisit that functionality since it was first implemented. Because loading the data stream stats requires hitting a separate API (https://github.com/elastic/kibana/pull/75107/files#diff-0db7f035e2e41be22bac202848c325fabf209f626b8a934d09cce5e9e074941bR34), and I think the stats themselves might take awhile to fetch, it might take awhile to retrieve the data streams along with their stats. I recommend pinging the ES Data Management team for more detailed and up-to-date info.

joshdover commented 2 years ago

This continues to be a problem for what I expect to be most Fleet customers. In my test cluster, I have ~60 data streams, with ~300 backing indices and the request to GET /api/fleet/data_streams is timing out on Kibana after 2 minutes, resulting in a 502 error in Cloud, likely from the proxy layer: backend closed connection.

I don't think this is anywhere close to large amount of data (I'm only ingesting data from ~6 integrations from 2 laptops that aren't even always in use).

@jen-huang I'm going to add this to our iteration board to look at in the next testing cycle. I think we should try to get a fix in for the 7.x series as well.

joshdover commented 2 years ago

I did some further digging in our production data here and I'm seeing about 2.5% of customers who attempted to use this page were affected by this bug in the last 7 days. I haven't dug further, but my guess is this affects our largest, most mature adopters of Fleet, an important segment. While the incidence rate isn't incredibly high, 97.5% isn't exactly a great SLA. I think prioritizing this is the right call.

thunderwood19 commented 2 years ago

@joshdover

Any update on this? I am one of the effected customers who relies heavily on fleet. If I can help with any logs/testing, I would be more than happy to!

joshdover commented 2 years ago

Hi @thunderwood19 we have this prioritized to be worked on soon but have not yet dug in further. In the meantime, I do suggest using the UI in Stack Management > Index Management > Data streams.


Related to this, in https://github.com/elastic/kibana/issues/126067 it was discovered that the user needs to have access to the manage cluster privilege in order to access the Data stream stats API. This limits the usability of this page now that we're allowing non-superusers to use Fleet.

I think this requirement gives us further reason to explore decoupling the request to the Data stream stats API from fetching the list of data streams. If we loaded the stats separately, we may be able to show the main list quicker while also providing a more progressive UI for users with lower privileges.

joshdover commented 2 years ago

@thunderwood19 Have you had a chance to test this on 8.1? We've made some improvements and I'm no longer seeing this issue as widespread in our production data or in my personal cluster on Elastic Cloud.

thunderwood19 commented 2 years ago

@thunderwood19 Have you had a chance to test this on 8.1? We've made some improvements and I'm no longer seeing this issue as widespread in our production data or in my personal cluster on Elastic Cloud.

Yep! I let my support know yesterday, I can see the data streams via Fleet gui just fine now on 8.1.0.

joshdover commented 2 years ago

Fantastic to hear, @jen-huang I'm going to de-prioritize this for now.

joshdover commented 2 years ago

Some improvements are being made in https://github.com/elastic/kibana/pull/130973 to switch to use the terms enum API instead of aggregations for some of the calculations which increases the request count, but should have a big improvement on overall perf.

Pagination would still be welcome to avoid the n+1 query problem we have right now

nimarezainia commented 2 years ago

@joshdover what remains for us to do in this regard? should we track this for 8.5 (for fleet scaling)

joshdover commented 2 years ago

I think we mostly need to do the pagination work at this point. I don't think it's super high priority right now though. It doesn't affect control plane scaling, mostly data plane.