Sharp edges cleanup for automated software installations based on policy failures

iansltx commented 2 months ago

Goal

User story
As an IT admin,
I want to have a smoother policy-based software install experience
so that I can be confident that my hosts are having software installed properly in response to failing policies.

Context

Requestor(s): @iansltx, ticket requested by @noahtalerman in this Slack thread
Product designer: _____

Cleanup items:

Version compare query footgun

Garden-variety version_compare queries, which is what we use as an example of how to build an "if this isn't up to date, install or update it" workflow, will false-pass on e.g. Firefox .deb builds. This is because the version string is suffixed with e.g. build2, which version_compare() will pick up and treat as if e.g. 129.0build2 is >= 129.0.2, leaving a user with that build on an older version when the intent was to bump to a newer one.

Given that this is the case on something as common as Firefox, we need to figure out a query that works for an instance like that and include that in obvious places in our documentation so admins have a quick fallback when the simpler query doesn't work correctly.

Installation success doesn't trigger immediate policy query re-evaluation

Currently there is no connection client-side between a software install and the query whose result triggered the software install. This means that once a software install to mitigate a policy failure occurs, the policy will stay red until the normal osquery update cadence happens, even though the install status gets updated sooner from what I can tell (rendering the Fleet UI inconsistent).

Proposed mitigation here is to run the policy query post-install and phone home with that result, which would update policy compliance status. We can't just set the policy to passing on sucessful install because it's entirely possible to install a package that doesn't make the query pass.

For what it's worth, having arbitrary post-install queries might be useful for other reasons, but I don't think exposing those would be in-scope here.

GitOps silently adds software-less policies if you indent the package wrong

Right now applying GitOps with an install_software policy key that is present but empty will silently apply the policy without an install action attached. This means that if e.g. I accidentally specify package_path at the wrong indentation level, I don't get any feedback that I did anything wrong at any point in the GitOps run (I made this mistake while QA'ing policy automation for software installation).

Proposed resolution is to be loud about these failures, as we should have enough information to determine intent here: if someone specs install_software: they want to know if they forgot an attachment.

Changes

Product

[ ] UI changes: No changes
[ ] CLI (fleetctl) usage changes: See third subheading
[ ] YAML changes: No changes
[ ] REST API changes: No changes
[ ] Fleet's agent (fleetd) changes: See second subheading
[ ] Activity changes: None
[ ] Permissions changes: None
[ ] Changes to paid features or tiers: None
[ ] Other reference documentation changes: Updates to "is current version installed" osquery examples
[ ] Once shipped, requester has been notified

Engineering

[ ] Feature guide changes: Include alternate queries that will e.g. work with Linux Firefox on 129.0~build2
[ ] Load testing: Test policy based software installs and confirm that the post-install queries land quickly at scale

ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

Requires load testing: Probably
Risk level: Low

Manual testing steps

TODO

Testing notes

Confirmation

[ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
[ ] QA (@____): Added comment to user story confirming successful completion of QA.

iansltx commented 1 month ago

The caveat on immediate policy update on successful execution/install also applies to script runs at this point. I think we'll need fleetd changes to allow evaluating the associated policy query immediately on script run, including tying the script back to the policy so we know which query to execute.

iansltx commented 1 month ago

Much more of a corner case, but if an installer or script is changed, any installs/executions already queued (for the old script or software title) when the change happens won't be cleared. Script executions also snapshot the script at the time it was queued, so changing the script itself later (only doable via GitOps) won't change what gets executed. This is a bit of a corner case though, as there's a short window between when we're notified of a policy failure and when the script is shipped to the host to run (on the order of seconds).

fleetdm / fleet