Open hop-dev opened 3 years ago
Pinging @elastic/fleet (Team:Fleet)
@elastic/kibana-stack-management have you all solved optimizing your usage of the Data Streams stats API? I noticed that by default, stats are excluded from your Data Streams UI (you have to switch on a toggle in the top right). Curious if there's any history behind this decision and if we should also consider excluding stats by default or removing them from the list view entirely.
@joshdover We haven't had an opportunity to revisit that functionality since it was first implemented. Because loading the data stream stats requires hitting a separate API (https://github.com/elastic/kibana/pull/75107/files#diff-0db7f035e2e41be22bac202848c325fabf209f626b8a934d09cce5e9e074941bR34), and I think the stats themselves might take awhile to fetch, it might take awhile to retrieve the data streams along with their stats. I recommend pinging the ES Data Management team for more detailed and up-to-date info.
This continues to be a problem for what I expect to be most Fleet customers. In my test cluster, I have ~60 data streams, with ~300 backing indices and the request to GET /api/fleet/data_streams
is timing out on Kibana after 2 minutes, resulting in a 502 error in Cloud, likely from the proxy layer: backend closed connection
.
I don't think this is anywhere close to large amount of data (I'm only ingesting data from ~6 integrations from 2 laptops that aren't even always in use).
@jen-huang I'm going to add this to our iteration board to look at in the next testing cycle. I think we should try to get a fix in for the 7.x series as well.
I did some further digging in our production data here and I'm seeing about 2.5% of customers who attempted to use this page were affected by this bug in the last 7 days. I haven't dug further, but my guess is this affects our largest, most mature adopters of Fleet, an important segment. While the incidence rate isn't incredibly high, 97.5% isn't exactly a great SLA. I think prioritizing this is the right call.
@joshdover
Any update on this? I am one of the effected customers who relies heavily on fleet. If I can help with any logs/testing, I would be more than happy to!
Hi @thunderwood19 we have this prioritized to be worked on soon but have not yet dug in further. In the meantime, I do suggest using the UI in Stack Management > Index Management > Data streams.
Related to this, in https://github.com/elastic/kibana/issues/126067 it was discovered that the user needs to have access to the manage
cluster privilege in order to access the Data stream stats API. This limits the usability of this page now that we're allowing non-superusers to use Fleet.
I think this requirement gives us further reason to explore decoupling the request to the Data stream stats API from fetching the list of data streams. If we loaded the stats separately, we may be able to show the main list quicker while also providing a more progressive UI for users with lower privileges.
@thunderwood19 Have you had a chance to test this on 8.1? We've made some improvements and I'm no longer seeing this issue as widespread in our production data or in my personal cluster on Elastic Cloud.
@thunderwood19 Have you had a chance to test this on 8.1? We've made some improvements and I'm no longer seeing this issue as widespread in our production data or in my personal cluster on Elastic Cloud.
Yep! I let my support know yesterday, I can see the data streams via Fleet gui just fine now on 8.1.0.
Fantastic to hear, @jen-huang I'm going to de-prioritize this for now.
Some improvements are being made in https://github.com/elastic/kibana/pull/130973 to switch to use the terms enum API instead of aggregations for some of the calculations which increases the request count, but should have a big improvement on overall perf.
Pagination would still be welcome to avoid the n+1 query problem we have right now
@joshdover what remains for us to do in this regard? should we track this for 8.5 (for fleet scaling)
I think we mostly need to do the pagination work at this point. I don't think it's super high priority right now though. It doesn't affect control plane scaling, mostly data plane.
Kibana version:
7.15.0, 7.16.0, master
Description of the problem including expected versus actual behavior:
Originally pointed out by @joshdover here:
The data stream view can be quite slow to load when there are a lot of streams. We currently get all data streams in one request without pagination and perform an aggregation per data stream.
This issue is to look into ways of improving the performance, current options discussed:
1. Using the data stream name to extract the type, dataset and namespace instead of aggregating
Currently, there is no guarantee that the constant_keyword values in the data match the data stream name. @ruflin suggested we could look at putting a feature request for elastic to validate the constant keywords against the data stream name allowing us to rely on this link.
However, we are now looking at adding another aggregation as part of https://github.com/elastic/integrations/issues/768 so there may no longer be a big efficiency gain to be found here.
2. Introducing pagination
We could introduce pagination to limit the work we do, however there would be some challenges:
event.ingested
. We would still have to get the values for all data streams and then sort in memory I believe, so there may not be a massive performance gain3. Combine individual aggregations into one aggregation I am not sure this is possible. We could find a way to use filters and sub aggregations to get the namespace, dataset and type for each data stream in one query. We would need to be able to distinguish each data stream using a
filter
query I believe and the only way to distinguish them would be to use the values we are querying for!Steps to reproduce:
/app/fleet/data-streams