fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
3.15k stars 431 forks source link

Policy automation telemetry #22917

Open iansltx opened 1 month ago

iansltx commented 1 month ago

lukeheath commented 2 weeks ago

@iansltx Thanks for filing this! Since this involves changes in the statistics we report, I want to run it through the drafting board before prioritizing to the release board.

@noahtalerman All of the information Ian's is proposing will be readily available, so this should be a small effort.

noahtalerman commented 2 weeks ago

num scripts run due to a policy failure, num installs initiated due to a policy failure

@iansltx how could collecting the number of runs help us?

I'm assuming this is an absolute count over the entire history of a Fleet instance.

noahtalerman commented 2 weeks ago

Goal

User story
As a Fleet engineer,
I want to have visibility on how much usage policy automations get for installs and script runs
so that I can optimize performance and UX to support customer use cases.

Objective

This smallish task will help ensure we build policy automations in a way that customers using Fleet Premium get the best experience out of those features.

Context

Changes

For telemetry stats, include counts for:

  1. Policies that have install automations
  2. Policies that have script automations
  3. Scripts run due to a policy failure
  4. Installs initiated due to a policy failure

Once #22424 is implemented, all of the above should be trivially queryable from exisitng database tables.

Product

Engineering

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

Manual testing steps

  1. Add some policy automations for both script runs and installs
  2. Fail those policies for some hosts
  3. Trigger stats collection and confirm that the reported stats include the expected values

Confirmation

  1. [ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
  2. [ ] QA (@____): Added comment to user story confirming successful completion of QA.
iansltx commented 2 weeks ago

@noahtalerman Yep, absolute count.

This would inform whether e.g. we're seeing customers use policy automations for patch management or other activities that are expected to fail as part of bringing a fleet of machines into spec. This would have implications for e.g. prioritizing #22920 or other UX improvements where policy failures are routine/expected rather than exceptional.

We could implement policy failure count telemetry on a rolling "since X days ago" basis as well and get similar information. Just a matter of matching how the telemetry is aggregated/displayed, and I'm thinking that running totals would be easier to manage from a metrics display perspective (but a bit heavier on the DB for collection of the data).