[Security Solution][Bug][Performance] Serverless: Overview page loading duration exceeds expected times.

MindyRS commented 1 month ago

Describe the bug:

On both the QA and Production security testing environments, we are exceeding the max page load time of 10 seconds (Today) and 15 seconds (7 days). This is also happening on a similarly configured and tested ESS instance.

These two environments are being exercised regularly by synthetic scripts loading a series of dashboards with different time filters. Dashboards for these synthetic runs can be found HERE)

Scenario 1 Steps - runs every 10 minutes, use Today time span for all pages Scenario 2 Steps - runs every hour, uses Last 7 Days time span Scenario 3 Steps - runs every 4 hours, uses the last 30 days time span

Production Environment Instance (ID bcd2dc79ea2e4f0d801aa769fdee3dc2) QA Environment Instance (ID cc5b2b7dab2b4b7b98a3a3eae71e8f98)

Browser and Browser OS versions:

Chrome

Elastic Endpoint version:

NA

Steps to reproduce:

Log into either Production and QA Serverless testing environments. Production Environment Instance (ID bcd2dc79ea2e4f0d801aa769fdee3dc2) QA Environment Instance (ID cc5b2b7dab2b4b7b98a3a3eae71e8f98)
Navigate to the Dashboards > Overview Page
Watch load time

Current behavior:

Expected behavior:

Screenshots (if relevant):

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant): ES Logs for QA Environment

Any additional context (logs, chat logs, magical formulas, etc.): Page load times for all UI elements

less than 10 sec @ “Today”
less than 15sec @ “Last 7 days”
less than 30sec @ “Last 30 days”

elasticmachine commented 1 month ago

Pinging @elastic/security-solution (Team: SecuritySolution)

elasticmachine commented 1 month ago

Pinging @elastic/security-threat-hunting (Team:Threat Hunting)

michaelolo24 commented 3 weeks ago

Thanks for opening this issue @MindyRS. Took a look into it, and the primary source of slowness for this page is coming from the Events histogram and Threat intelligence panel on the overview page as evident in the following video:

https://github.com/user-attachments/assets/a087a69b-6ec7-4c45-9d82-5421854285ab

When investigating the query that populates that view, there isn't much that can be done to necessarily improve the performance of the events histogram query as testing with @angorayc who had investigated this issue before, we noticed the only thing that improved the performance of the query was removing logs-* which while possible is impractical, because users will be able to update that default data view to whatever index patterns they see fit. We can alternatively collapse that visualization by default for the page render. Will be discussing with @paulewing regarding this

Here's a screenshot of the search profiler for reference:

As an aside, I noticed that the way performance is tracked is how long the global loading indicator is on the page? I'm not sure if that's necessarily a good test of performance vs something like time to interaction for whatever the core workflow on the page might be. For instance on the hosts page (we could ask...when does the events table loaded?). The only reason I bring this up is because the slowness we're seeing also happens in other parts of the application such as the discover histogram as well, but while the visualization takes time to load, the rest of the discover application is able to be interacted with

angorayc commented 5 days ago

This is the query we use to render the events histogram: It'd be great to know where is the main cause of the performance issue. As @michaelolo24 mentioned, the performance issue happend when logs-* engaged, but sometimes it doesn't seem to do with logs-* either. For example, it took 6s in the page to fetch the data, and it indicates it spent most of the time querying and aggregating on .alerts-* data view.

Would like to know if there is a way to figure out if the performance issue is caused by the data view or the query we use? and how to improve or avoid the issue.

GET .alerts-*, auditbeat-*, filebeat-*, logs-*, winlogbeat-*/_search
{
  "aggs": {
    "0": {
      "terms": {
        "field": "event.dataset",
        "order": {
          "_count": "desc"
        },
        "size": 10,
        "shard_size": 25
      },
      "aggs": {
        "1": {
          "date_histogram": {
            "field": "@timestamp",
            "fixed_interval": "30m",
            "time_zone": "UTC",
            "extended_bounds": {
              "min": 1725840000000,
              "max": 1725926399999
            }
          }
        }
      }
    }
  },
  "size": 0,
  "_source": {
    "excludes": []
  },
  "query": {
    "bool": {
      "must": [],
      "filter": [
        {
          "bool": {
            "must": [],
            "filter": [],
            "should": [],
            "must_not": []
          }
        },
        {
          "bool": {
            "should": [
              {
                "match_phrase": {
                  "_index": ".alerts-*"
                }
              },
              {
                "match_phrase": {
                  "_index": "auditbeat-*"
                }
              },
              {
                "match_phrase": {
                  "_index": "filebeat-*"
                }
              },
              {
                "match_phrase": {
                  "_index": "logs-*"
                }
              },
              {
                "match_phrase": {
                  "_index": "winlogbeat-*"
                }
              }
            ],
            "minimum_should_match": 1
          }
        },
        {
          "range": {
            "@timestamp": {
              "format": "strict_date_optional_time",
              "gte": "2024-09-09T00:00:00.000Z",
              "lte": "2024-09-09T23:59:59.999Z"
            }
          }
        }
      ],
      "should": [],
      "must_not": []
    }
  },
  "stored_fields": [
    "*"
  ],
  "runtime_mappings": {
    "test": {
      "type": "keyword"
    },
    "custom_field": {
      "type": "keyword",
      "script": {
        "source": "emit(doc['@timestamp'].value.getDayOfWeekEnum().toString())"
      }
    },
    "Day": {
      "type": "keyword",
      "script": {
        "source": "emit(doc['@timestamp'].value.getDayOfWeekEnum().toString())"
      }
    }
  },
  "script_fields": {},
}

elastic / kibana

[Security Solution][Bug][Performance] Serverless: Overview page loading duration exceeds expected times. #190444