elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.74k stars 8.14k forks source link

[Epic] Gather user/usage telemetry from the Entity Analytics dashboard #145276

Closed machadoum closed 1 year ago

machadoum commented 1 year ago

Goal: Research

To understand better the user behavior in Security Solutions, we need to add telemetry to the entity analytics page. This ticket collects a list of metrics relevant to Entity Analytics.

Appended Feb 24 @SourinPaul: I'm assuming we can collect cloud adoption telemetry from FullStory. To quantify the user value, let's prioritize collecting a few important usage matrices. This will also help drive subsequent roadmap enhancements.

Index:

Tags:

[core-telemetry] - Telemetry that should be tracked across all kibana pages [job-metric] - Job metrics that the ML team should collect

Key user actions:

Anomaly job status:

Key features:

UI assets: (as of 8.7)

Feature Usage:

Feature Adoption:

Descoped:

References:

SourinPaul commented 1 year ago

@machadoum updated the ticket body ^^ with telemetry instrumentation ideas.

elasticmachine commented 1 year ago

Pinging @elastic/security-threat-hunting (Team:Threat Hunting)

elasticmachine commented 1 year ago

Pinging @elastic/security-solution (Team: SecuritySolution)

machadoum commented 1 year ago

I added tags to classify which teams are responsible for implementing the metric. We need to reach these teams and check which metrics they have implemented. And how we could add the ones they haven't implemented.

SourinPaul commented 1 year ago

@machadoum can you please update this ticket with your findings? Thanks!

machadoum commented 1 year ago

Hi @SourinPaul, This ticket is still in our backlog, but I haven't found the time to work on it. I should have some time for it after 8.7 FF.

machadoum commented 1 year ago

Current Entity Analytics Page Telemetry

The Kibana usage_collection plugin automatically collects Application usage telemetry. It contains the number of clicks and minutes on screen aggregated by 7, 30, and 90 days:

"appId": "security-solutions",
"viewId": "entity-analytics",
"clicks_7_days": 10,
"clicks_30_days": 20,
"clicks_90_days": 100,
"clicks_total": 140,
"minutes_on_screen_7_days": 1.5,
"minutes_on_screen_30_days": 10.0,
"minutes_on_screen_90_days": 11.5,
"minutes_on_screen_total": 32.5

Screenshot 2023-02-08 at 09 41 58

We send UI counter events when the user enables or disables an ML job using Security Solutions UI (it includes the ML widget). Counter events are a key/value pair.

"key":"siem_job_enabled",
"value":1

We send UI counter events when an error happens during the job installation:

Application usage and counter events are enhanced with a set of properties such as app version, cloud.account-id, cluster-name, license.type, etc. This extra data can be used as a filter or to plot visualizations:

Links: Dashboard with entity analytics telemetry) Usage collection plugin doc A spreadsheet summarizing security solutions telemetry

machadoum commented 1 year ago

Feasibility of the Requested Telemetry

Custom telemetry that the security solutions team can implement

  • Daily user action to filter Risk Score panels with Risk Classification filters per cluster

We can send UI counter events when users click on "Risk Classification filters". But currently, UI counters are aggregated by 7, 30, and 90 days.


  • User action to launch new investigations by a logged-in user (anonymize the user)

We can send UI counter events when users click launch new investigations. But from what I am aware, we don't have user data to aggregate by.


  • User action applying 'Host/ User Names' as global filters. Normalize by a logged-in user.

We can send UI counter events when the user clicks on the Hover filter for a 'Host/ User Names' field. Should we send the event on every page? Or only when the user is on the entity analytics page? We can't filter it afterwards.


  • Time to enable the feature by key features

What is a key feature? Is it the ML job?

When does the time start ticking? When the entity analytics page loads?


Generic telemetry that could be collected by the platform-analytics

The following telemetries are generic and could be collected by the Application usage event for every page, not only EA. So let's create feature requests for @elastic/platform-analytics team.

  • Avg. time spent on the page before key user action is taken
  • Daily engaged users on the EA page (time spent > 5 mins) && 1+ key user action

What is a key user interaction?


  • Daily impressions on the EA page (time spent > 1 mins & < 5 mins)

Application usage already collects page impression and screen time. Is it enough?


  • Total # of unique user sessions (unique logged-in users) per hour
  • Total time spent on the page by each logged-in user per session

Application usage doesn't collect user and session data. It also doesn't aggregate per hour. We should create a feature request to @elastic/platform-analytics.


Telemetry that the ML team might collect

  • Daily count of anomaly jobs by status
  • Count of anomaly results by job_id per cluster

Those sound like generic ML telemetry. Not specific to the EA page. Let's double-check if the ML team already collects this data. Otherwise, we can create a task that runs every hour/day and sends counter metrics from the backend for each installed job.

machadoum commented 1 year ago

@ajosh0504 shared with me a dashboard that includes "Daily count of anomaly jobs by status" https://stack-telemetry.elastic.dev/s/machine-learning/app/dashboards#/view/39c6e0b8-8d23-572f-91f8-01f9a80d1b66?_g=(filters%3A!())

SourinPaul commented 1 year ago

Thanks, @machadoum. I have updated the ticket body based on your research findings.

An additional questions:

User action applying 'Host/ User Names' as global filters. Normalize by a logged-in user. We can send UI counter events when the user clicks on the Hover filter for a 'Host/ User Names' field. Should we send the event on every page? Or only when the user is on the entity analytics page? We can't filter it afterwards.

Are you concerned that we cannot segment user-action counts by the kibana source if we collect on every page?

Application usage doesn't collect user and session data. It also doesn't aggregate per hour. We should create a feature request to https://github.com/orgs/elastic/teams/platform-analytics.

Do you get an ETA from the platform telemetry when such user/ session telemetry may be available? Wonder if they are already working towards this - given user assignment to cases has been introduced in the soltuion. Else I can follow up.

Note I appended a new KPI, Drill downs of anomalies under the Feature Usage section.

machadoum commented 1 year ago

Let me correct a mistake from my side. If we use ui-counter events, we can segment them by day and hour. We are only limited by 7-30-90 days aggregations for application-usage events. I update the ticket description to remove this limitation.

machadoum commented 1 year ago

Update: Due to the many limitations of ui-counter events, we decided to move in a different direction and use Event-Based Telemetry (EBT) instead. The main difference between ui-counter events and EBT is that EBT allows sending a series of events without the need to pre-aggregate those events on Kibana's side before sending. That capability will enable analysts to slice and dice the usage during analysis.

I have added three events to the Entity Analytics page as a POC (https://github.com/elastic/kibana/pull/152338). The data is shipped to ebt-kibana-browser index and can be accessed on staging here: https://telemetry-v2-staging.elastic.dev/.

I created Lens visualization) to exemplify how we can query the data.

Screenshot 2023-03-23 at 11 27 53

If you explore the index, you will notice that event-specific data is stored inside the properties.* field, and general data is available inside the context.* field. We can extend both fields and add any properties we need. Some of the properties that are available by default are context.cloudId, context.license_type, context.session_id, context.userId, context.version, context.viewport_height and more.

EBT doc (requires vercel login)

machadoum commented 1 year ago

I am moving it to "done" because we implemented all telemetry that could be collected in the Explore area. @SourinPaul Please validate they satisfy product needs.

These items were NOT implemented: