Specify host identifier used by fleetd

ksatter commented 1 year ago

Goal

User story
As an Endpoint Engineer,
I want Orbit (one component of the fleetd agent) to enroll using the same host identifier as osquery (another component of fleetd)
so that I don't see duplicate hosts enroll to Fleet.

Changes

Product

[ ] CLI usage changes: Add --host-identifier flag to fleetctl package command. Sets host identifier used by all components of fleetd (osquery and Orbit). Options are uuid and instance. Default is uuid.
- This populates a new environment variable for fleetd: ORBIT_HOST_IDENTIFIER. This way, the user can update this env variable via automation tool (ex. Chef) and force the host to reenroll w/o having to deploy a new package.
[ ] fleetd changes:
- [ ] fleetd is forced to use the --host-identifier value set during fleetctl package no matter what value is set in an osquery flagfile.
- [ ] fleetd is forced to use the --extensions-autoload value set by itself (orbit always sets this to /opt/orbit/extensions.autoload) no matter what value is set in an osquery flagfile or Fleet YAML.
[ ] API changes: Add validation to command_line_flags in config and team Fleet YAML so that Fleet returns an error if --host-identifier or --extensions-autoload are set.
- Use this error message: The [insert unsupported flag here] flag isn't supported. Please remove this flag.
[ ] UI changes: Make sure the new validation above is presented as an error notification on Organization settings > Agent options and Team details > Agent options pages.
[ ] Outdated documentation changes: Document the --host-identifier flag, it's options, and the default in the fleetd configuration options section.

Context

Today, Orbit (a component of fleetd) always uses the host's uuid as its host identifier. This isn't configurable.
A customer is setting the host identifier used by osquery (another component of fleetd) to instance. They're doing this using an osquery flagfile.
When osquery and Orbit use different host identifiers, this results in more than one hosts (duplicates) enrolling to Fleet.

QA

Risk assessment

Requires load testing: TODO
Risk level: Low / High TODO
Risk description: TODO

Manual testing steps

Unless stated otherwise you need to use the Fleet version of PR #15570 (or main if it was already merged)
All test scenarios must be tested on the three OSs.
All test scenarios should be tested with MDM enabled and disabled.

First you will need to create a new local TUF repository.

# Use the PR's branch

git checkout 14879-fleetd-host-identifier

#
# You may need to modify the URLs depending on your host and VMs setup
#
# Note that Fleet Desktop is disabled (customer expecting this change doesn't use Fleet Desktop)
#

rm -rf test_tuf

SYSTEMS="macos windows linux" \
PKG_FLEET_URL=https://host.docker.internal:8080 \
PKG_TUF_URL=http://host.docker.internal:8081 \
DEB_FLEET_URL=https://host.docker.internal:8080 \
DEB_TUF_URL=http://host.docker.internal:8081 \
RPM_FLEET_URL=https://host.docker.internal:8080 \
RPM_TUF_URL=http://host.docker.internal:8081 \
MSI_FLEET_URL=https://host.docker.internal:8080 \
MSI_TUF_URL=http://host.docker.internal:8081 \
GENERATE_PKG=1 \
GENERATE_DEB=1 \
GENERATE_RPM=1 \
GENERATE_MSI=1 \
ENROLL_SECRET=uBtn8G3ONN9J2Ouib/t/yE0sa7w2iYNk \
INSECURE=1 \
./tools/tuf/test/main.sh

Testing scenarios for QA:

A. Test the new feature with two VMs that have the same hardware UUID and serial number. You can simulate this with the following steps:

Installing a package that was generated with --host-identifier=instance, e.g. for macOS:

./build/fleetctl package --type=pkg --fleet-desktop --fleet-url=https://host.docker.internal:8080 --enroll-secret=uBtn8G3ONN9J2Ouib/t/yE0sa7w2iYNk --insecure --debug --update-roots=$(./build/fleetctl updates roots --path ./test_tuf) --update-interval=10s --disable-open-folder --update-url=http://host.docker.internal:8081 --host-identifier=instance

Check that the host is enrolled in the Fleet UI.
Uninstall the package.
Install the same package again.
You should now see two hosts with the same data in Fleet. (Because they now use instance as its identifier instead of the hardware UUID/serial)

B. Test generating packages without --host-identifier=instance (basically default behavior). Test both Fleet with MDM enabled and disabled.

C. Test packages generated without --host-identifier=instance against the latest released Fleet (4.41.1). Hosts should enroll without issues.

D. Test customers upgrading the flag on already installed fleetd instances. (Meaning they won't re-install the package and instead set the orbit flag manually or via config management like Chef.)

This should be tested on Linux and Windows. Not on macOS. Make sure to have Fleet Desktop disabled.

Upgrade Fleet from fleet-v4.41.0 to this version (fleet-v4.42.0).
Installing a package that was generated without --host-identifier with an old version of Orbit (e.g. latest released Orbit). You can do this by generating the TUF repository with main.sh on fleet-v4.41.1.
Check that the host is enrolled in the Fleet UI. MySQL: select id, osquery_host_id, hostname, uuid, hardware_serial from fleet.hosts; osquery_host_id should match uuid.
Publish this new version of orbit to your local TUF GOOS=linux GOARCH=amd64 go build -o orbit-linux ./orbit/cmd/orbit && ./tools/tuf/test/push_target.sh linux orbit orbit-linux 43). Orbit should auto-update.
Stop the fleetd service: (e.g. on Linux sudo systemctl stop orbit)
Delete the host in Fleet.
Go to the orbit configuration and add the following flag to it: ORBIT_HOST_IDENTIFIER set to instance (on Linux it's /etc/defaults/orbit).
Start service (on macOS sudo systemctl start orbit).
Host should be enrolled again but with the osquery identifier as osquery_host_id: MySQL: select id, osquery_host_id, hostname, uuid, hardware_serial from fleet.hosts; osquery_host_id should match what osquery reports in instance_id when running the query select instance_id from osquery_info;.

For Windows:

Use the Services app to stop/start the fleetd service
Use the Registry app to add the --host-identifier=instance option to orbit's invocation. (On my VM it's registry key: Computer\HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Services\Fleet osquery)

E. Test enrolling vanilla osquery against Fleet.

--

Other things to take a look at:

MDM manual and DEP enroll should be smoke tested.
We modified how orbit loads the hardware UUID on Windows. Orbit installation and enroll should be smoke tested on Windows 10 and 11.
Test the fleetd's extensions by labels feature

Testing notes

Confirmation

[ ] Engineer (@____): Added comment to user story confirming succesful completion of QA.
[x] QA (@xpkoala ): Added comment to user story confirming succesful completion of QA.

noahtalerman commented 1 year ago

@zayhanlon and @ksatter heads up, this story will be air guitar'd during the next design sprint.

noahtalerman commented 1 year ago

Notes from internal call: https://docs.google.com/document/d/1o8446eiAk-z2Bm_GSz0mZdb7CPTF0eymv-CtIm_VVxU/edit#heading=h.8oggoi13xg7y

More info in customer thread in Slack (internal): https://fleetdm.slack.com/archives/C03AE5T2EQ0/p1698348948683799

ksatter commented 1 year ago

@noahtalerman It could be possible to manage this through Agent options, but it would require changes to the enrollment process. I thought of two potential scenarios for that:

By default, fleetd gathers all of the values for different identifiers and passes them in the enroll request. When the request is received, Fleet checks config and uses the appropriate value
Do a pre-enrollment check-in. Grab osquery and fleetd flags at this point, then send an enroll request with the appropriate data

noahtalerman commented 1 year ago

Option 1

[ ] CLI usage changes: Add --host-identifier flag to fleetctl package command. Sets host identifier used by all components of fleetd (osquery and Orbit).
- Options are provided, uuid, hostname, or instance. Default is uuid. (options match osquery_host_identifier)
[ ] Outdated documentation changes: Document the --host-identifier flag, it's options, and the default in the fleetd configuration options section.

Option 2

fleetd picks up whatever osquery has set as host-identifier. Or when you specify host-identifier is sets it for both fleetd and osquery.

Michael: This way, we could continue using the same workflow: give teams osquery flag file w/ host-identifier specified. If this won't work, fall back to option 1.

How? Fleetd starts osquery and runs this query to get host_identifier:

select value, instance_id from osquery_flags JOIN osquery_info where name = 'host_identifier';

Noah: If we can, I think the flagfile, if present, is the source of truth for osquery flags for the customer. The osquery flagfile overrides all osquery flags set remotely (agent_options.command_line_flags in Fleet YAML). We don't document this. Using Fleet YAML is best practice.

Noah: Can we make this work? Zach: Yes but using an osquery flagfile may be breaking the extensions loading because Fleet needs to be writing to a flagfile. Might be a reason to do option 1.

TODO Noah: Ask support if setting an osquery flagfile breaks managing extensions remotely.

Why? We're not certain that Orbit and osquery having the same host identifier will resolve the problem with receiving extensions.

noahtalerman commented 1 year ago

Future problem:

As an Endpoint Engineer, I want to be able to specify --tls_client_cert, --tls_client_key, and --watchdog_** flags via fleetctl package command or Fleet YAML so that I don't have to use an osquery flagfile.

UPDATE: The Fleet YAML already supports --fleet-tls-client-certificate, --fleet-certificate, and --watchdog_** flags.

The customer can use these instead of osquery flagfile.

The customer can set different values for these flags in each team in Fleet.

Are teams granular enough?

(2023-12-01)

Today, the customer is setting these flags in the osquery flagfile and updating them remotely via Chef.

My guess is that they can't use Fleet YAML to update these remotely because different hosts need different values for these flags. Targeting based on teams isn't sufficient (not granular enough) because the customer uses teams for a rollout use case (staging and production).

The likely solution to this problem is to allow different agent options based on label.

noahtalerman commented 12 months ago

@lucasmrod TODOs are in this Google doc: https://docs.google.com/document/d/187PA5ctmIFjD8-HLkAEYXjf0OOL61bvGhNnu-2VZa24/edit

noahtalerman commented 11 months ago

[ ] fleetd changes:

[ ] fleetd is forced to use the --host_identifier value set during fleetctl package no matter what value is set in an osquery flagfile.

[ ] fleetd is forced to use the --extensions-autoload value set by agent_options.extensions in Fleet YAML no matter what value is set in an osquery flagfile.

@lucasmrod here's how I summarized the fleetd changes based on discussion in this Google doc.

What do you think?

lucasmrod commented 11 months ago

fleetd is forced to use the --extensions-autoload value set by agent_options.extensions in Fleet YAML no matter what value is set in an osquery flagfile.

Should be something like:

fleetd is forced to use the --extensions-autoload value set by itself (orbit always sets this to /opt/orbit/extensions.autoload) no matter what value is set in an osquery flagfile or Fleet YAML.

lucasmrod commented 11 months ago

No migration needed for existing hosts. User will have to reinstall package to use this feature.

Once released users that are hitting these 2-hosts-as-1-bug will have to:

Remove the hosts with the 2-hosts-as-1-bug from Fleet.
Re-generate the package with the new fleetctl version (with fleetctl package --host_identifier=instance [...]) and re-install such package on the hosts.

noahtalerman commented 11 months ago

Remove the hosts with the 2-hosts-as-1-bug from Fleet.

@lucasmrod the customer has already deleted the Orbit enrolled host record in Fleet.

If I'm understanding correctly, they will have to delete the osquery enrolled host too?

If they want to, can they delete the osquery enrolled host after they install the new fleetd w/ --host-identifier flag? (step 2)

JoStableford commented 11 months ago

Related to a Slack conversation

lucasmrod commented 11 months ago

@noahtalerman Am changing the env var from HOST_IDENTIFIER to ORBIT_HOST_IDENTIFIER (all orbit variables have the ORBIT_ prefix).

lucasmrod commented 11 months ago

fleetd is forced to use the --host_identifier value set during fleetctl package no matter what value is set in an osquery flagfile.

Am also editing --host_identifier to --host-identifier.

lucasmrod commented 11 months ago

@xpkoala QA steps were added to the description.

fleet-release commented 10 months ago

Unified identifiers, No more duplicates in sight. Fleet's path becomes clear.

fleetdm / fleet