Orbit extensions autoupdate -- potential gotchas

sharvilshah commented 1 year ago

Goal

Given the complexity around Orbit now that it has a channel to Fleet to authenticate and enroll, we would like to list/document potential gotchas with the extensions autoupdate work. The main goal is to try and be aware, and mitigate potential issues as best as possible.

Scenario

Due to a bad config or a bad extension, osquery tries to load an extension, and fails. Orbit restarts osquery, but fails again, and it's a crash loop.

Unlike the flags update, which checks for updates every 30 seconds, the extensions update check interval might be 1 hour, or even longer. Thus there is an extended period of time where hosts could be offline/unrecoverable in that period.

Potential mitigation: Orbit to incorporate safety checks early, on encountering extensions_autoload flag. This will include borrowing the safety checks that osquery does -- whether the path exists, whether the binary has safe permissions (note this might mean different thing on different OSes), whether the ".ext" of the extension is correct, etc. Other substantial checks could include whether the extension filetype is a binary file, and whether it's ELF/MachO based on host OS (this could be a future enhancement, I don't know how difficult it is to determine binary headers in go).

TBD: How does this work with --allow_unsafe flag, do we obey the flag, and document that it can be potentially dangerous?

Scenario

An extension (either a new extension, or a new version of existing extension), on startup, uses extensive CPU/memory (say to load something, or load a bpf probe), causing it to be killed and reloaded, and entering the kill/reload loop. Note that extensions can be more than just "table extensions", they can be logger plugins, config plugins, etc.

If osquery process doesn't exit -- meaning osquery kills/reloads extension until retries are exhausted, Orbit wouldn't even know that extensions are getting killed.

Potential mitigation: Document and educate folks on various extensions flags to tune it better, with timeouts, etc. and make use of osquery_extensions table

Scenario

We don't know how people deploy their TUF server

Potential mitigation: We use our TUF server as a reference implementation

I will update this as I keep digging.

sharvilshah commented 1 year ago

@zwass @lucasmrod @roperzh would love your inputs whenever you get a chance

lucasmrod commented 1 year ago

Unlike the flags update, which checks for updates every 30 seconds, the extensions update check interval might be 1 hour, or even longer. Thus there is an extended period of time where hosts could be offline/unrecoverable in that period.

Just like with flags, we'd do a "initial check of extensions"

Potential mitigation: Orbit to incorporate safety checks early, on encountering extensions_autoload flag. This will include borrowing the safety checks that osquery does -- whether the path exists, whether the binary has safe permissions (note this might mean different thing on different OSes), whether the ".ext" of the extension is correct, etc. Other substantial checks could include whether the extension filetype is a binary file, and whether it's ELF/MachO based on host OS (this could be a future enhancement, I don't know how difficult it is to determine binary headers in go).

Yes, Orbit should do the obvious sanity checks. E.g. when auto-updating targets in Orbit we perform a check by running them with --help to know if they run successfully on the host (see here). We'd run a list of checks on such extensions before allowing them to download.

TBD: How does this work with --allow_unsafe flag, do we obey the flag, and document that it can be potentially dangerous?

Any reason for us to obey such dangerous flag? At first, we could make Orbit ignore it and not allow such a thing. And wait for users to ask for it and hear their feedback as to why we would need it.

If osquery process doesn't exit -- meaning osquery kills/reloads extension until retries are exhausted, Orbit wouldn't even know that extensions are getting killed.

Orbit (or Fleet) could send the following query to osqueryd:

SELECT * FROM osquery_registry
    WHERE active = true
    AND internal = false
    AND registry = 'table';

to know if extensions are working as expected.

lucasmrod commented 1 year ago

From https://osquery.readthedocs.io/en/stable/development/osquery-sdk/

There are two ways to run an extension: load the extension at an arbitrary time after shell or daemon execution, or request an "autoload" of extensions. The auto-loading method has several advantages, such as allowing dependencies on external config plugins and inheriting the same process monitoring as is applied to the osquery core worker processes.

Do we want to stick with auto-loaded extensions? Or can we consider the non-auto-loaded (like the way Orbit registers its extension tables)? One advantage of the latter is that if the extension fail it might not bring osquery down (to be confirmed).

fleetdm / fleet