fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
2.97k stars 413 forks source link

Enable extension based remediation workflows for hosts that may be offline when queries are run #22221

Open ddribeiro opened 1 week ago

ddribeiro commented 1 week ago

Gong snippet: https://us-65885.app.gong.io/call?id=7283736297840441495&highlights=%5B%7B%22type%22%3A%22SHARE%22%2C%22from%22%3A1126%2C%22to%22%3A3694%7D%5D

Goals

Problem

customer-figali currently has a workflow using vanilla osquery in which they perform remediation actions on their hosts using osquery extensions. This workflow is depending on targeting a host with a query and knowing that if the host is offline, it will eventually run the query when it comes back online.

They would like to use Fleet's live queries to trigger this workflow, but if a host is offline when the live query is run, it will not run that query when it comes back online.

What have you tried?

customer-figali would like to use Fleet to run the queries but are running into the following issues:

  1. In its current form, live queries will not work since an offline host will not run the query after the live query window closes. The existing workflow needs assurances that a query will eventually run when a host comes back online.
  2. Scheduled queries would be difficult to manage as the queries being used in the workflow are dynamic and are expected to change from host to host. The queries also need to be targeted for specific hosts. Running the query against all hosts on a team means remediation steps would be performed on hosts.

Potential solutions

There could be 2 potential paths to a solution:

  1. Fleet already has a queueing system that is uses for script execution. If an admin runs a script on a host that is offline, Fleet will queue that script and run it when the host comes online. The same functionality could be extended to live queries. This option would allow customer-figali to generally use their same workflow and move it to Fleet.

  2. customer-figali would migrate their workflow to use scripts and take advantage of the queuing system for scripts that exists in Fleet today.

    • Using this method they would invoke osqueryi in the script, which would run the query and trigger the remediation extension.
    • This method would require some additional work in Fleet to address concerns customer-figali has about allowing their tools to run arbitrary code on their hosts.
    • Their Fleet instance would need to contain an allowlist of trusted scripts. Can be added by anybody, most likely via GitOps user.
    • Scripts would need to have the ability to be parameterized (#19582)
    • Execution of the scripts can only be performed by a service account that has the ability to run scripts but not add new ones.

Note: This method is dependent on customer-figali testing to make sure invoking osqueryi using scripts works as expected

What is the expected workflow as a result of your proposal?

As a result of this proposal, `customer-figali` would either: 1. Use live queries to trigger a query on a host (and have assurance that it will eventually run if the host is currently offline) and adopt their existing workflow to Fleet. 2. Modify their existing workflow to be scripts based and take advantage of the queuing capability that already exists in Fleet. - The customer would have a set list of allowed scripts that can be customized with parameters. - The scripts would be triggered with an account that only has permissions to run those trusted scripts.
ddribeiro commented 1 week ago

cc: @mikermcneil @noahtalerman