Open MindyRS opened 1 month ago
Pinging @elastic/security-solution (Team: SecuritySolution)
Pinging @elastic/security-threat-hunting (Team:Threat Hunting)
Thanks for opening this issue @MindyRS. Took a look into it, and the primary source of slowness for this page is coming from the Events histogram and Threat intelligence panel on the overview page as evident in the following video:
https://github.com/user-attachments/assets/a087a69b-6ec7-4c45-9d82-5421854285ab
When investigating the query that populates that view, there isn't much that can be done to necessarily improve the performance of the events histogram query as testing with @angorayc who had investigated this issue before, we noticed the only thing that improved the performance of the query was removing logs-*
which while possible is impractical, because users will be able to update that default data view to whatever index patterns they see fit. We can alternatively collapse that visualization by default for the page render. Will be discussing with @paulewing regarding this
Here's a screenshot of the search profiler for reference:
As an aside, I noticed that the way performance is tracked is how long the global loading indicator is on the page? I'm not sure if that's necessarily a good test of performance vs something like time to interaction for whatever the core workflow on the page might be. For instance on the hosts page (we could ask...when does the events table loaded?). The only reason I bring this up is because the slowness we're seeing also happens in other parts of the application such as the discover histogram as well, but while the visualization takes time to load, the rest of the discover application is able to be interacted with
This is the query we use to render the events histogram: It'd be great to know where is the main cause of the performance issue.
As @michaelolo24 mentioned, the performance issue happend when logs-*
engaged, but sometimes it doesn't seem to do with logs-*
either. For example, it took 6s in the page to fetch the data, and it indicates it spent most of the time querying and aggregating on .alerts-*
data view.
Would like to know if there is a way to figure out if the performance issue is caused by the data view or the query we use? and how to improve or avoid the issue.
GET .alerts-*, auditbeat-*, filebeat-*, logs-*, winlogbeat-*/_search
{
"aggs": {
"0": {
"terms": {
"field": "event.dataset",
"order": {
"_count": "desc"
},
"size": 10,
"shard_size": 25
},
"aggs": {
"1": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "30m",
"time_zone": "UTC",
"extended_bounds": {
"min": 1725840000000,
"max": 1725926399999
}
}
}
}
}
},
"size": 0,
"_source": {
"excludes": []
},
"query": {
"bool": {
"must": [],
"filter": [
{
"bool": {
"must": [],
"filter": [],
"should": [],
"must_not": []
}
},
{
"bool": {
"should": [
{
"match_phrase": {
"_index": ".alerts-*"
}
},
{
"match_phrase": {
"_index": "auditbeat-*"
}
},
{
"match_phrase": {
"_index": "filebeat-*"
}
},
{
"match_phrase": {
"_index": "logs-*"
}
},
{
"match_phrase": {
"_index": "winlogbeat-*"
}
}
],
"minimum_should_match": 1
}
},
{
"range": {
"@timestamp": {
"format": "strict_date_optional_time",
"gte": "2024-09-09T00:00:00.000Z",
"lte": "2024-09-09T23:59:59.999Z"
}
}
}
],
"should": [],
"must_not": []
}
},
"stored_fields": [
"*"
],
"runtime_mappings": {
"test": {
"type": "keyword"
},
"custom_field": {
"type": "keyword",
"script": {
"source": "emit(doc['@timestamp'].value.getDayOfWeekEnum().toString())"
}
},
"Day": {
"type": "keyword",
"script": {
"source": "emit(doc['@timestamp'].value.getDayOfWeekEnum().toString())"
}
}
},
"script_fields": {},
}
Describe the bug:
On both the QA and Production security testing environments, we are exceeding the max page load time of 10 seconds (Today) and 15 seconds (7 days). This is also happening on a similarly configured and tested ESS instance.
These two environments are being exercised regularly by synthetic scripts loading a series of dashboards with different time filters. Dashboards for these synthetic runs can be found HERE)
Scenario 1 Steps - runs every 10 minutes, use Today time span for all pages Scenario 2 Steps - runs every hour, uses Last 7 Days time span Scenario 3 Steps - runs every 4 hours, uses the last 30 days time span
Production Environment Instance (ID bcd2dc79ea2e4f0d801aa769fdee3dc2) QA Environment Instance (ID cc5b2b7dab2b4b7b98a3a3eae71e8f98)
Browser and Browser OS versions:
Chrome
Elastic Endpoint version:
NA
Steps to reproduce:
Current behavior:
Expected behavior:
Screenshots (if relevant):
Errors in browser console (if relevant):
Provide logs and/or server output (if relevant): ES Logs for QA Environment
Any additional context (logs, chat logs, magical formulas, etc.): Page load times for all UI elements