Open yctercero opened 11 months ago
We got this SDH where user is asking whether it is possible to extend data view on Alerts page to include alerts from remote clusters. Right now data view is restricted to .alerts-security.alerts-default
index pattern only.
It could be a nice enhancement for the users, so adding this request here.
https://github.com/elastic/kibana/assets/7609147/86fc3bbf-b9b6-4f67-9be3-a394301941b8
Next steps: analyse if it's possible to render some data while we fetch fields in background
So in the video below, I hardcoded the response from the fields API, and set fields as: []
https://github.com/elastic/kibana/assets/7609147/c6447231-82e9-4644-a7c2-8f714c8499c5
We can see that there are some errors for filters, but otherwise, it's still a huge value, to load alerts rules page, or others.
My proposal:
After some investigation, it will be clear if we can do it page by page, or if there need to migrate all pages in one. Page by page is a less risky approach
Another observation:
Looks like internal/data_views/_fields_for_wildcard
blocks all security pages on the first load.
When we go to another page (client-side navigation) it blocks only the alert page render.
Also if the user goes to another page, and then to another, we can see that the system makes multiple requests to field API, which makes waiting longer, and in some cases, ES can die because of that.
This post aims to provide a performance overview of the sourcerer at the moment and summarise my finding.
TLDR: Sourcerer can still cause performance problems and affect page load if the client has a lot of indices/fields. Browser caching can help significantly, but not fix the problem 100%. There are potential relatively quick wins that allow us to unblock page load and reduce the number of API calls. We need to rethink our future use of the sourcerer and fetching fields strategy.
Sourcerer - is a component which provides security solution info about data views, indices, patterns, and fields.
Currently, it uses the data_view plugin, which makes API calls to /internal/data_views/fields
(Previously internal/data_views/_fields_for_wildcard
)
If users have a lot of data, these API calls can be slow. In extreme cases with 1000 indices and 25k fields for default cloud env - the response takes around 25 seconds.
At the moment, for Infosec this request takes ~ 7 seconds and contains ~27k fields.
This API call blocks the initial load of every security page. So every page reload, or new tab opening will cause a wait for this API response and users will see no data, just the loader.
It's happening, because we are fetching data view when we create a redux store on plugin initialisation.
Further client navigation to other pages doesn't request this API, except the Alerts page, which is render always blocked by this API.
This is happening, because of the filter component.
Recently there was a browser cache implemented for /internal/data_views/fields
It uses stale-while-revalidate under the hood.
Cache-Control: private, max-age=5, stale-while-revalidate=31535995
Here is an example of how the cache works. Is means that in theory, after we have value in the cache we have 3 ways:
In real-life testing, it was working well, when I did have a small response from this API.
In testing, with the response which takes ~6 seconds, I noticed 2 things:
https://drive.google.com/file/d/11SeonxolRrlrsAgaeZTOjqUbDXfVLDVc/view?usp=drive_link
For clients who have very slow responses, or do not have stable revalidation, we can recommend increasing data_views:cache_max_age
in Advanced settings as a workaround.
Summary: cache works well for a majority of cases, and will dramatically increase user performance. There are cases with cancellation, or not using cache which we need to investigate.
Issue It's not blocking the initial page load, meaning not query the whole data view when we create a store. Interestingly enough, this optimisation already was done, and then we migrated to using data view API, and it stopped working.
Issue When we go to some page, and then back to the Alerts page, it makes an additional request to data view API and for some reason, this request is not cached:
https://github.com/elastic/kibana/assets/7609147/3f327a3e-e678-47fc-a75f-1f2149ada207
There are interesting RFC which aims to be able to load fields async.
We should probably reconsider if we should store on UI the whole list of fields. It's hard to imagine how users need 25k fields at the same time. But keep this in memory, passing as props across applications probably affects the memory consumption of UI, but also ES.
There were several times when my cloud instance died because those API requests consumed a lot of resources. 25 fields, 6 seconds responses, but multiple requests.
Here example of how in UI, to render around 10 fields for an event, we potentially process a 20k array to a map, to be able to get names. It's not yet clear how it affects UI, more performance tests are needed, but potentially it can cause some problems.
Regardless of if the api call is cached or not, I don't think this call should block the creation of the redux store. Creating an empty store, and then hitting this api and updating the store after it returns would offer a lot of benefits: easier debugging, show an empty message explaining what's happening and/or a slow message due to number of fields, provide better ux for users by not showing a blank page, a whole host of benefits. I started to do this in https://github.com/elastic/kibana/pull/158637 however there are a massive number of failed tests, and some components are expecting parts of state to always exist, where they wouldn't if we went the route of init after store creation.
On the frontend, I suspect there are some useEffects that are only being prevented from running in an infinite loop through use of
// eslint-disable-next-line react-hooks/exhaustive-deps
, and the necessity of https://github.com/elastic/kibana/blob/main/x-pack/plugins/security_solution/public/common/components/sourcerer/use_pick_index_patterns.tsx#L68 supports that ha. I think there are a few main problems: 1) a lot of information that exists in redux is duplicated, although in a slightly different form, in component state, manipulated via the dispatch of useState, and then sometimes set back in redux via actions fired via useEffect. 2) a lot of functions used in renders return something referentially different every time, like patternListToOptions. useState dispatch in combination with bullet 2 is where I think the tendency to an infinite loop is coming from, and to fix that, we should remove all instances of useState from x-pack/plugins/security_solution/public/common/components/sourcerer/use_pick_index_patterns.tsx , x-pack/plugins/security_solution/public/common/components/sourcerer/index.tsx, possibly more, and only change values in redux via actions, and only storing the minimal needed state. missingPatterns/setMissingPatterns is already a value on the sourcererScope model https://github.com/elastic/kibana/blob/main/x-pack/plugins/security_solution/public/common/components/sourcerer/index.tsx#L70, not sure why it needs to be in state as well, lots of other usages of useState seem duplicated as well. A lot of the useEffects seem to be used as a communication mechanism back to a parent component, overall number of them should be reduced a lot.
Just wanna bump this one https://github.com/elastic/kibana/issues/172534#issuecomment-1910603248 as we've got another SDH asking for supporting custom alerts indices in Security Data View Alerts
.
TLDR: Centralize discussion on sourcerer improvements.
Summary
Sourcerer refers to the logic used in the security solution UI that determines what indices to query on what pages. The initial design of this component included use of the default security solution data view. This data view includes all the indices specified under the Advanced Settings > Security Solution > Default indices. Out of the box, this list includes very broad index patterns like
logs-*
.Product may be able to give more detailed background here, but the initial intent behind creating this all encompassing data view was to allow the user to interact with all aspects of the app without needing to worry about specifying a source index for the data to start streaming in. So as an example - when a user adds an integration, that data is sourced to a
logs-INTEGRATION_NAME
index that gets picked up bylogs-*
. Instead of having to update the advanced settings every time a new integration is added, it "just works" out of the box.Issues
Periodically, and again recently, we have hit performance issues. Often the issues observed are:
Proposed next steps
While
logs-*
may be the offending index pattern at the moment, the issue at large is how we scale. Today, it may belogs-*
, but next week it may be another that has ballooned.