fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
3.17k stars 434 forks source link

Sharp edges cleanup for automated software installations based on policy failures #22302

Open iansltx opened 2 months ago

iansltx commented 2 months ago

Goal

User story
As an IT admin,
I want to have a smoother policy-based software install experience
so that I can be confident that my hosts are having software installed properly in response to failing policies.

Context

Cleanup items:

Version compare query footgun

Garden-variety version_compare queries, which is what we use as an example of how to build an "if this isn't up to date, install or update it" workflow, will false-pass on e.g. Firefox .deb builds. This is because the version string is suffixed with e.g. build2, which version_compare() will pick up and treat as if e.g. 129.0build2 is >= 129.0.2, leaving a user with that build on an older version when the intent was to bump to a newer one.

Given that this is the case on something as common as Firefox, we need to figure out a query that works for an instance like that and include that in obvious places in our documentation so admins have a quick fallback when the simpler query doesn't work correctly.

Installation success doesn't trigger immediate policy query re-evaluation

Currently there is no connection client-side between a software install and the query whose result triggered the software install. This means that once a software install to mitigate a policy failure occurs, the policy will stay red until the normal osquery update cadence happens, even though the install status gets updated sooner from what I can tell (rendering the Fleet UI inconsistent).

Proposed mitigation here is to run the policy query post-install and phone home with that result, which would update policy compliance status. We can't just set the policy to passing on sucessful install because it's entirely possible to install a package that doesn't make the query pass.

For what it's worth, having arbitrary post-install queries might be useful for other reasons, but I don't think exposing those would be in-scope here.

GitOps silently adds software-less policies if you indent the package wrong

Right now applying GitOps with an install_software policy key that is present but empty will silently apply the policy without an install action attached. This means that if e.g. I accidentally specify package_path at the wrong indentation level, I don't get any feedback that I did anything wrong at any point in the GitOps run (I made this mistake while QA'ing policy automation for software installation).

Proposed resolution is to be loud about these failures, as we should have enough information to determine intent here: if someone specs install_software: they want to know if they forgot an attachment.

Changes

Product

Engineering

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

Manual testing steps

TODO

Testing notes

Confirmation

  1. [ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
  2. [ ] QA (@____): Added comment to user story confirming successful completion of QA.
iansltx commented 1 month ago

The caveat on immediate policy update on successful execution/install also applies to script runs at this point. I think we'll need fleetd changes to allow evaluating the associated policy query immediately on script run, including tying the script back to the policy so we know which query to execute.

iansltx commented 1 month ago

Much more of a corner case, but if an installer or script is changed, any installs/executions already queued (for the old script or software title) when the change happens won't be cleared. Script executions also snapshot the script at the time it was queued, so changing the script itself later (only doable via GitOps) won't change what gets executed. This is a bit of a corner case though, as there's a short window between when we're notified of a policy failure and when the script is shipped to the host to run (on the order of seconds).