Policy automation telemetry

iansltx commented 1 month ago

@noahtalerman: Fleet contributor requested this because they want to understand the usage of the new policy automation features. This way, they can make propose performance optimizations and UX improvements using the usage data.
- @noahtalerman: Eventually Fleet could report num policies that have install automations, num policies that have script automations, num scripts run due to a policy failure, num installs initiated due to a policy failure.

lukeheath commented 2 weeks ago

@iansltx Thanks for filing this! Since this involves changes in the statistics we report, I want to run it through the drafting board before prioritizing to the release board.

@noahtalerman All of the information Ian's is proposing will be readily available, so this should be a small effort.

noahtalerman commented 2 weeks ago

num scripts run due to a policy failure, num installs initiated due to a policy failure

@iansltx how could collecting the number of runs help us?

I'm assuming this is an absolute count over the entire history of a Fleet instance.

noahtalerman commented 2 weeks ago

Goal

User story
As a Fleet engineer,
I want to have visibility on how much usage policy automations get for installs and script runs
so that I can optimize performance and UX to support customer use cases.

Objective

This smallish task will help ensure we build policy automations in a way that customers using Fleet Premium get the best experience out of those features.

Context

Product designer: _____

Changes

For telemetry stats, include counts for:

Policies that have install automations
Policies that have script automations
Scripts run due to a policy failure
Installs initiated due to a policy failure

Once #22424 is implemented, all of the above should be trivially queryable from exisitng database tables.

Product

[ ] Once shipped, requester has been notified

Engineering

[ ] Load testing: Check that the proposed queries don't meaningfully contribute to load on a production workload. Having a query that completes in <100ms should be sufficient here.

ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

Requires load testing: Yes
Risk level: Low

Manual testing steps

Add some policy automations for both script runs and installs
Fail those policies for some hosts
Trigger stats collection and confirm that the reported stats include the expected values

Confirmation

[ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
[ ] QA (@____): Added comment to user story confirming successful completion of QA.

iansltx commented 2 weeks ago

@noahtalerman Yep, absolute count.

This would inform whether e.g. we're seeing customers use policy automations for patch management or other activities that are expected to fail as part of bringing a fleet of machines into spec. This would have implications for e.g. prioritizing #22920 or other UX improvements where policy failures are routine/expected rather than exceptional.

We could implement policy failure count telemetry on a rolling "since X days ago" basis as well and get similar information. Just a matter of matching how the telemetry is aggregated/displayed, and I'm thinking that running totals would be easier to manage from a metrics display perspective (but a bit heavier on the DB for collection of the data).

fleetdm / fleet