Enable extension based remediation workflows for hosts that may be offline when queries are run

ddribeiro commented 2 months ago

Gong snippet: https://us-65885.app.gong.io/call?id=7283736297840441495&highlights=%5B%7B%22type%22%3A%22SHARE%22%2C%22from%22%3A1126%2C%22to%22%3A3694%7D%5D

Goals

Have an option in Fleet to prevent arbitrary scripts
Ability to have a service account that can execute scripts but can't add new ones.
Ability to parameterize scripts (#19582)

Problem

customer-figali currently has a workflow using vanilla osquery in which they perform remediation actions on their hosts using osquery extensions. This workflow is depending on targeting a host with a query and knowing that if the host is offline, it will eventually run the query when it comes back online.

They would like to use Fleet's live queries to trigger this workflow, but if a host is offline when the live query is run, it will not run that query when it comes back online.

What have you tried?

customer-figali would like to use Fleet to run the queries but are running into the following issues:

In its current form, live queries will not work since an offline host will not run the query after the live query window closes. The existing workflow needs assurances that a query will eventually run when a host comes back online.
Scheduled queries would be difficult to manage as the queries being used in the workflow are dynamic and are expected to change from host to host. The queries also need to be targeted for specific hosts. Running the query against all hosts on a team means remediation steps would be performed on hosts.

Potential solutions

There could be 2 potential paths to a solution:

Fleet already has a queueing system that is uses for script execution. If an admin runs a script on a host that is offline, Fleet will queue that script and run it when the host comes online. The same functionality could be extended to live queries. This option would allow customer-figali to generally use their same workflow and move it to Fleet.
customer-figali would migrate their workflow to use scripts and take advantage of the queuing system for scripts that exists in Fleet today.
- Using this method they would invoke osqueryi in the script, which would run the query and trigger the remediation extension.
- This method would require some additional work in Fleet to address concerns customer-figali has about allowing their tools to run arbitrary code on their hosts.
- Their Fleet instance would need to contain an allowlist of trusted scripts. Can be added by anybody, most likely via GitOps user.
- Scripts would need to have the ability to be parameterized (#19582)
- Execution of the scripts can only be performed by a service account that has the ability to run scripts but not add new ones.

Note: This method is dependent on customer-figali testing to make sure invoking osqueryi using scripts works as expected

What is the expected workflow as a result of your proposal?

As a result of this proposal, `customer-figali` would either: 1. Use live queries to trigger a query on a host (and have assurance that it will eventually run if the host is currently offline) and adopt their existing workflow to Fleet. 2. Modify their existing workflow to be scripts based and take advantage of the queuing capability that already exists in Fleet. - The customer would have a set list of allowed scripts that can be customized with parameters. - The scripts would be triggered with an account that only has permissions to run those trusted scripts.

ddribeiro commented 2 months ago

cc: @mikermcneil @noahtalerman

JoStableford commented 2 months ago

Linked to Unthread ticket:

Inquiry about running live queries via REST API #2789)

bgirardeau-figma commented 2 months ago

Interested in this ticket, I'd definitely prefer option (1) and am not sure I would use option (2):

Fleet already has a queueing system that is uses for script execution. If an admin runs a script on a host that is offline, Fleet will queue that script and run it when the host comes online. The same functionality could be extended to live queries. This option would allow customer-figali to generally use their same workflow and move it to Fleet.

The advantage of this is to be able to leverage native osquery live querying functionality, which already provides a way to run scoped queries. I would like it to be easy for customers to leverage the full capabilities built in to osquery, without having to enable Fleet to run arbitrary code on devices (through scripts) at all.

If Fleet is going to offer the live query capability at all (which I think it should and appreciate!), it makes sense to me to fold it into Fleet's queuing system as a new type of scheduled action -- the idea being that not all scheduled actions on a host need to be expressed as a Bash script.

Generally nervous about increasing reliance on Bash scripts overall, which seem harder to write robustly and correctly than SQL queries to osquery (or native osquery extension code which can be written in Go, for example).

fleetdm / fleet