elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.61k stars 8.22k forks source link

[Security Solution][Sourcerer] Addressing sourcerer performance #172534

Open yctercero opened 11 months ago

yctercero commented 11 months ago

TLDR: Centralize discussion on sourcerer improvements.

Summary

Sourcerer refers to the logic used in the security solution UI that determines what indices to query on what pages. The initial design of this component included use of the default security solution data view. This data view includes all the indices specified under the Advanced Settings > Security Solution > Default indices. Out of the box, this list includes very broad index patterns like logs-*.

Product may be able to give more detailed background here, but the initial intent behind creating this all encompassing data view was to allow the user to interact with all aspects of the app without needing to worry about specifying a source index for the data to start streaming in. So as an example - when a user adds an integration, that data is sourced to a logs-INTEGRATION_NAME index that gets picked up by logs-*. Instead of having to update the advanced settings every time a new integration is added, it "just works" out of the box.

Issues

Periodically, and again recently, we have hit performance issues. Often the issues observed are:

Proposed next steps

  1. Profile page load performance - perhaps our baseline should be our customer zero that sits at around 23K fields
  2. Determine the desired UX - What is the intended user experience? Is there a strong argument for maintaining the idea of a default security data view encompassing all indices? If we allow users to somehow persist what index patterns get queried on certain pages, how would we want that experience to look like? Do they define these settings in Advanced Settings?
  3. Determine some possible short term gains - @paulewing suggested that we only pull in the ECS + runtime fields, all other fields can be fetched on demand.

While logs-* may be the offending index pattern at the moment, the issue at large is how we scale. Today, it may be logs-*, but next week it may be another that has ballooned.

e40pud commented 9 months ago

We got this SDH where user is asking whether it is possible to extend data view on Alerts page to include alerts from remote clusters. Right now data view is restricted to .alerts-security.alerts-default index pattern only.

It could be a nice enhancement for the users, so adding this request here.

nkhristinin commented 9 months ago

How to reproduce slow page load, because of sourcerer

https://github.com/elastic/kibana/assets/7609147/86fc3bbf-b9b6-4f67-9be3-a394301941b8

Next steps: analyse if it's possible to render some data while we fetch fields in background

nkhristinin commented 9 months ago

So in the video below, I hardcoded the response from the fields API, and set fields as: []

https://github.com/elastic/kibana/assets/7609147/c6447231-82e9-4644-a7c2-8f714c8499c5

We can see that there are some errors for filters, but otherwise, it's still a huge value, to load alerts rules page, or others.

My proposal:

After some investigation, it will be clear if we can do it page by page, or if there need to migrate all pages in one. Page by page is a less risky approach

nkhristinin commented 9 months ago

Another observation: Looks like internal/data_views/_fields_for_wildcard blocks all security pages on the first load. When we go to another page (client-side navigation) it blocks only the alert page render.

Also if the user goes to another page, and then to another, we can see that the system makes multiple requests to field API, which makes waiting longer, and in some cases, ES can die because of that.

Screenshot 2024-02-01 at 10 22 05
nkhristinin commented 9 months ago
old Another observation regarding [caching](https://github.com/elastic/kibana/pull/168910) `/internal/data_views/fields` API. It uses the `stale-while-revalidate` strategy to cache this request. From [PR](https://github.com/elastic/kibana/pull/168910) looks like it should always get data from the cache: > This means that after the initial field list request, all subsequent requests return the cached response which is very fast. If the cache is determined to be stale then the cache will update in the background and new data will be available on the next request. When I tested in real life sometimes data wasn't getting from the cache, but from real API. The explanation is most likely this (from [here](https://web.dev/articles/stale-while-revalidate#whats_it_mean)): >However, if the stale cached response is old enough that it falls outside the stale-while-revalidate window of time, then it will not fulfil the browser's request. The browser will instead retrieve a response from the network, and use that for both fulfilling the initial request and also populating the local cache with a fresh response. In the video you can see the left side is 8.12 and the right side is 8.13 Both environments have 25k fields and ~ 1000 indices. - we can see that the first request slow for all - The second request is slow for 8.12 and instant for 8.13 (cache) - and then at some point 8.13 get data not from cache https://github.com/elastic/kibana/assets/7609147/bf27b5b4-8e46-49d5-b2f4-0cc02ddb422c https://github.com/elastic/kibana/assets/7609147/2f6bc6c8-44ce-4484-85c9-d066eb9772c9 To make the cache more stable, we can recommend users who have problems with API change `data_views:cache_max_age` in advanced settings (From 5 seconds to 100 for example) ### Conclusion - Fields cache for `/internal/data_views/fields` can dramatically make the user experience better, if they have a lot of fields/indices. - First-page load, and load after cache is stale will be slow. We should probably still investigate if we can make it faster or feel faster for users.
nkhristinin commented 8 months ago

Sourcerer performance overview

This post aims to provide a performance overview of the sourcerer at the moment and summarise my finding.

TLDR: Sourcerer can still cause performance problems and affect page load if the client has a lot of indices/fields. Browser caching can help significantly, but not fix the problem 100%. There are potential relatively quick wins that allow us to unblock page load and reduce the number of API calls. We need to rethink our future use of the sourcerer and fetching fields strategy.

Content:

  1. What is the sourcerer and which performance problem it has
  2. How caching will help
  3. Potential "small" fixes
  4. Other places to look
  5. Future improvements

What is the sourcerer and which performance problem it has

Sourcerer - is a component which provides security solution info about data views, indices, patterns, and fields.

Currently, it uses the data_view plugin, which makes API calls to /internal/data_views/fields (Previously internal/data_views/_fields_for_wildcard)

If users have a lot of data, these API calls can be slow. In extreme cases with 1000 indices and 25k fields for default cloud env - the response takes around 25 seconds.

At the moment, for Infosec this request takes ~ 7 seconds and contains ~27k fields.

This API call blocks the initial load of every security page. So every page reload, or new tab opening will cause a wait for this API response and users will see no data, just the loader.

It's happening, because we are fetching data view when we create a redux store on plugin initialisation.

Further client navigation to other pages doesn't request this API, except the Alerts page, which is render always blocked by this API.

This is happening, because of the filter component.

How caching will help

Recently there was a browser cache implemented for /internal/data_views/fields

It uses stale-while-revalidate under the hood.

Cache-Control: private, max-age=5, stale-while-revalidate=31535995

Here is an example of how the cache works. Is means that in theory, after we have value in the cache we have 3 ways:

In real-life testing, it was working well, when I did have a small response from this API.

In testing, with the response which takes ~6 seconds, I noticed 2 things:

https://drive.google.com/file/d/11SeonxolRrlrsAgaeZTOjqUbDXfVLDVc/view?usp=drive_link

For clients who have very slow responses, or do not have stable revalidation, we can recommend increasing data_views:cache_max_age in Advanced settings as a workaround.

Summary: cache works well for a majority of cases, and will dramatically increase user performance. There are cases with cancellation, or not using cache which we need to investigate.

Potential "small" fixes

  1. Issue It's not blocking the initial page load, meaning not query the whole data view when we create a store. Interestingly enough, this optimisation already was done, and then we migrated to using data view API, and it stopped working.

  2. Issue When we go to some page, and then back to the Alerts page, it makes an additional request to data view API and for some reason, this request is not cached:

    1. Should we make this request?
    2. How to make it cached.

Other places to look

https://github.com/elastic/kibana/assets/7609147/3f327a3e-e678-47fc-a75f-1f2149ada207

Future improvements

There are interesting RFC which aims to be able to load fields async.

We should probably reconsider if we should store on UI the whole list of fields. It's hard to imagine how users need 25k fields at the same time. But keep this in memory, passing as props across applications probably affects the memory consumption of UI, but also ES.

There were several times when my cloud instance died because those API requests consumed a lot of resources. 25 fields, 6 seconds responses, but multiple requests.

Here example of how in UI, to render around 10 fields for an event, we potentially process a 20k array to a map, to be able to get names. It's not yet clear how it affects UI, more performance tests are needed, but potentially it can cause some problems.

kqualters-elastic commented 8 months ago

Regardless of if the api call is cached or not, I don't think this call should block the creation of the redux store. Creating an empty store, and then hitting this api and updating the store after it returns would offer a lot of benefits: easier debugging, show an empty message explaining what's happening and/or a slow message due to number of fields, provide better ux for users by not showing a blank page, a whole host of benefits. I started to do this in https://github.com/elastic/kibana/pull/158637 however there are a massive number of failed tests, and some components are expecting parts of state to always exist, where they wouldn't if we went the route of init after store creation.

kqualters-elastic commented 8 months ago

On the frontend, I suspect there are some useEffects that are only being prevented from running in an infinite loop through use of // eslint-disable-next-line react-hooks/exhaustive-deps , and the necessity of https://github.com/elastic/kibana/blob/main/x-pack/plugins/security_solution/public/common/components/sourcerer/use_pick_index_patterns.tsx#L68 supports that ha. I think there are a few main problems: 1) a lot of information that exists in redux is duplicated, although in a slightly different form, in component state, manipulated via the dispatch of useState, and then sometimes set back in redux via actions fired via useEffect. 2) a lot of functions used in renders return something referentially different every time, like patternListToOptions. useState dispatch in combination with bullet 2 is where I think the tendency to an infinite loop is coming from, and to fix that, we should remove all instances of useState from x-pack/plugins/security_solution/public/common/components/sourcerer/use_pick_index_patterns.tsx , x-pack/plugins/security_solution/public/common/components/sourcerer/index.tsx, possibly more, and only change values in redux via actions, and only storing the minimal needed state. missingPatterns/setMissingPatterns is already a value on the sourcererScope model https://github.com/elastic/kibana/blob/main/x-pack/plugins/security_solution/public/common/components/sourcerer/index.tsx#L70, not sure why it needs to be in state as well, lots of other usages of useState seem duplicated as well. A lot of the useEffects seem to be used as a communication mechanism back to a parent component, overall number of them should be reduced a lot.

e40pud commented 8 months ago

Just wanna bump this one https://github.com/elastic/kibana/issues/172534#issuecomment-1910603248 as we've got another SDH asking for supporting custom alerts indices in Security Data View Alerts.