lucasmrod commented 1 month ago

fleetd version: all.

This bug is related to the P0 #22687.

💥 Actual behavior

Currently, if Fleet's TUF root.json signature is expired fleetd won't start up (exits early over and over because it's a service). The following logs repeated over and over:

2024-10-07T18:12:21+08:00 INF update metadata. using saved metadata error="update metadata: tuf: failed to decode root.json: expired at 2024-10-06 17:47:49 +0000 UTC"
2024-10-07T18:12:21+08:00 ERR run orbit failed error="target orbit lookup: lookup orbit: expired at 2024-10-06 17:47:49 +0000 UTC"

Causing hosts that restarted after the expiration event to be offline.

🌷 Expected behavior

If fleet's TUF root signature is expired, fleetd should log an error and continue its execution (not exit).

🧑‍💻 Steps to reproduce

Reproduce bug

Follow steps below to generate a local TUF repository with an expired root.json and reproduce the issue (to reproduce main or whatever branch you have checked out should not have the fleetd fix for this issue):

Create TUF repository with root that expires in 30 minutes (~30 minutes to allow the creation of packages from it and their installation on the OS, adjust as needed) and generate packages for the three OSs:
```
# Tear down current local TUF
rm -r test_tuf
```

Change the following variable in ee/fleetctl/updates.go:

from:

keyExpirationDuration = 10 365 24 * time.Hour

to:

keyExpirationDuration = 30 * time.Minute

Build local TUF and generate packages.

SYSTEMS="macos windows linux" \ PKG_FLEET_URL=https://host.docker.internal:8080 \ PKG_TUF_URL=http://host.docker.internal:8081 \ DEB_FLEET_URL=https://host.docker.internal:8080 \ DEB_TUF_URL=http://host.docker.internal:8081 \ MSI_FLEET_URL=https://host.docker.internal:8080 \ MSI_TUF_URL=http://host.docker.internal:8081 \ GENERATE_PKG=1 \ GENERATE_DEB=1 \ GENERATE_MSI=1 \ ENROLL_SECRET=... \ FLEET_DESKTOP=1 \ USE_FLEET_SERVER_CERTIFICATE=1 \ ./tools/tuf/test/main.sh


2. Install the three generated fleetd packages on Windows, Linux and macOS devices.

3. Reproduce issue by waiting the 30m for the roots to expire. Restart fleetd on the OSs and fleetd should fail to start.

4. Attempt to generate packages using the local TUF, should also fail:

./build/fleetctl package --type=pkg --fleet-url=https://host.docker.internal:8080 --enroll-secret=<...> --update-roots=$(./build/fleetctl updates roots --path ./test_tuf) --disable-open-folder --update-interval=1m --debug --update-url=http://host.docker.internal:8081 --enable-scripts


#### Test fix

Perform the same steps as before but with `main` or whatever branch that has the start up fix for fleetd.

### QA

- This is a fleetd change, so you will need to use a local TUF repository to test.
- All tests need to be executed on the three supported OSs: Linux, macOS, and Windows.
- You may need to tweak some variables if running the TUF server on Apple Silicon.
- In all tests you have to test fleetd restarts:
Linux:

sudo systemctl restart orbit

macOS:

sudo launchctl unload /Library/LaunchDaemons/com.fleetdm.orbit.plist && sudo launchctl load /Library/LaunchDaemons/com.fleetdm.orbit.plist

Windows:
Via Start menu > Services > right click on "Fleet osquery" service > Restart.

#### 0. Reproduce issue

First, attempt to reproduce issue by using `fleet-v4.58.X` tag and run the steps to reproduce above.

#### A. Happy path

A.1 Create a local TUF repository as usual, as documented in tools/tuf/test/README.md:
https://github.com/fleetdm/fleet/blob/6e9955d7c74b972a1c40b94e966617cbcdef32d7/tools/tuf/test/README.md?plain=1#L23-L42. Test fleetd restarts.

A.2 Happy path, use `update_channels` feature as documented [here](https://github.com/fleetdm/fleet/blob/6e9955d7c74b972a1c40b94e966617cbcdef32d7/docs/Configuration/agent-configuration.md?plain=1#L266). When using the local TUF repository you have the following channels `stable` and `42` (you can create new channels by pushing new updates to the local TUF repository). Test fleetd restarts.

A.3 Test `update_channels` with a channel that doesn't exist yet, e.g. `edge` for all three targets:
```sh
update_channels:
  orbit: edge
  osqueryd: edge
  desktop: edge

Test fleetd restart works a-ok.

Then push an update to such edge channel for the three components and check if they are auto-updated:

# macOS
CGO_ENABLED=0 GOOS=darwin GOARCH=amd64 go build -x -o orbit-darwin -ldflags="-X github.com/fleetdm/fleet/v4/orbit/pkg/build.Version=edge" ./orbit/cmd/orbit
FLEET_TARGETS_PASSPHRASE=p4ssphr4s3 FLEET_SNAPSHOT_PASSPHRASE=p4ssphr4s3 FLEET_TIMESTAMP_PASSPHRASE=p4ssphr4s3 ./build/fleetctl updates add --path ./test_tuf --target ./orbit-darwin --platform macos --name orbit --version 43 -t edge

# Linux
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -x -o orbit-linux -ldflags="-X github.com/fleetdm/fleet/v4/orbit/pkg/build.Version=edge" ./orbit/cmd/orbit
FLEET_TARGETS_PASSPHRASE=p4ssphr4s3 FLEET_SNAPSHOT_PASSPHRASE=p4ssphr4s3 FLEET_TIMESTAMP_PASSPHRASE=p4ssphr4s3 ./build/fleetctl updates add --path ./test_tuf --target ./orbit-linux --platform linux --name orbit --version 43 -t edge

# Windows
CGO_ENABLED=0 GOOS=windows GOARCH=amd64 go build -x -o orbit.exe -ldflags="-X github.com/fleetdm/fleet/v4/orbit/pkg/build.Version=edge" ./orbit/cmd/orbit
FLEET_TARGETS_PASSPHRASE=p4ssphr4s3 FLEET_SNAPSHOT_PASSPHRASE=p4ssphr4s3 FLEET_TIMESTAMP_PASSPHRASE=p4ssphr4s3 ./build/fleetctl updates add --path ./test_tuf --target ./orbit.exe --platform windows --name orbit --version 43 -t edge

B. TUF server down

TUF server being down should cause no issues to fleetd (just errors in fleetd logs):

pkill file-server || true

To restart the TUF server

go run ./tools/file-server 8081 "test_tuf/repository" &

Then restart fleetd and again there should be no issues, just errors in fleetd logs.

C. Test expirations

C.1.1 Timestamp role expired

We can reuse any running local TUF repository and run the following to expire the timestamp role signature:

EXTRA_FLEETCTL_LDFLAGS="$EXTRA_FLEETCTL_LDFLAGS -X github.com/fleetdm/fleet/v4/ee/fleetctl.timestampExpirationDuration=1m" make fleetctl
./build/fleetctl updates timestamp --path ./test_tuf

./build/fleetctl updates timestamp --path ./test_tuf

C.1.2 Install fleetd package with expired timestamp

fleetd should be installed successfully and work as usual (just error in logs).

C.2.1 Snapshot role expired

We'll rebuild the TUF repository to avoid error-prone steps when testing snapshot expiration. You will need to re-install packages on the three OSs.

By rebuild I mean start from scratch:

rm -rf test_tuf

Run same main.sh invocation but with the following additional variable:

SNAPSHOT_EXPIRATION_DURATION=10m \
[...]
main.sh

Which means the snapshot role will expire after 10m (10m to give time for the packages to be generated).

Wait for snapshot role to expire, agents should continue to work (just error logs around snapshot being expired).

Restart fleetd and make sure it's up and running after the restart, just TUF error in the logs.

C.2.2 Snapshot role is fixed

Fix the snapshot expiration by rotating the root key:

make fleetctl
FLEET_TARGETS_PASSPHRASE=p4ssphr4s3 FLEET_SNAPSHOT_PASSPHRASE=p4ssphr4s3 FLEET_TIMESTAMP_PASSPHRASE=p4ssphr4s3 FLEET_ROOT_PASSPHRASE=p4ssphr4s3 ./build/fleetctl updates rotate --path ./test_tuf root

Agents should restart after some time and then continue to work as usual (no more TUF error logs).

C.3.1 Targets role expired

We'll rebuild the TUF repository to avoid error-prone steps when testing targets expiration. You will need to re-install packages on the three OSs.

By rebuild I mean start from scratch:

rm -rf test_tuf

Run same main.sh invocation but with the following additional variable:

TARGETS_EXPIRATION_DURATION=10m \
[...]
main.sh

Which means the targets role will expire after 10m (10m to give time for the packages to be generated).

Wait for targets role to expire, agents should continue to work (just error logs around targets being expired)

Restart fleetd and make sure it's up and running after the restart, just TUF error in the logs.

C.3.2 Targets role is fixed

Fix the targets expiration by rotating the root key:

make fleetctl
FLEET_TARGETS_PASSPHRASE=p4ssphr4s3 FLEET_SNAPSHOT_PASSPHRASE=p4ssphr4s3 FLEET_TIMESTAMP_PASSPHRASE=p4ssphr4s3 FLEET_ROOT_PASSPHRASE=p4ssphr4s3 ./build/fleetctl updates rotate --path ./test_tuf root

Agents should restart after some time and then continue to work as usual (no more TUF error logs).

C.4.1 Root role expired

We'll rebuild the TUF repository to avoid error-prone steps when testing targets expiration. You will need to re-install packages on the three OSs.

By rebuild I mean start from scratch:

rm -rf test_tuf

Run same main.sh invocation but with the following additional variable:

KEY_EXPIRATION_DURATION=10m \
[...]
main.sh

Which means the root role will expire after 10m (10m to give time for the packages to be generated).

Wait for root role to expire, agents should continue to work (just error logs around root being expired)

Restart fleetd and make sure it's up and running after the restart, just TUF error in the logs.

C.4.2 Root role is fixed

Fix the root expiration by rotating the root key:

make fleetctl
FLEET_TARGETS_PASSPHRASE=p4ssphr4s3 FLEET_SNAPSHOT_PASSPHRASE=p4ssphr4s3 FLEET_TIMESTAMP_PASSPHRASE=p4ssphr4s3 FLEET_ROOT_PASSPHRASE=p4ssphr4s3 ./build/fleetctl updates rotate --path ./test_tuf root

Agents should restart after some time and then continue to work as usual (no more TUF error logs).

C.5.1 Test with everything expired

We'll rebuild the TUF repository to avoid error-prone steps when testing targets expiration. You will need to re-install packages on the three OSs.

By rebuild I mean start from scratch:

rm -rf test_tuf

Run same main.sh invocation but with the following additional variables:

KEY_EXPIRATION_DURATION=10m \
TARGETS_EXPIRATION_DURATION=10m \ 
SNAPSHOT_EXPIRATION_DURATION=10m \
TIMESTAMP_EXPIRATION_DURATION=10m \
[...]
main.sh

Which means all roles will expire after 10m (10m to give time for the packages to be generated).

Wait for all role to expire, agents should continue to work (just error logs around root being expired)

Restart fleetd and make sure it's up and running after the restart, just TUF error in the logs.

lucasmrod commented 1 month ago

@zwass @lukeheath This fix is to reduce impact of the P0 #22687, so that an expired root signature does not bring existing fleetd instances down (and only impact package generation).

sharon-fdm commented 1 month ago

@lucasmrod, I removed the :reproduce label since I assume you did it. Put it back if it's needed.

zwass commented 1 month ago

Ideally we should have fleetd report to the Fleet server and the server have some way to notify admins that this is happening.

sharon-fdm commented 1 month ago

Hey team! Please add your planning poker estimate with Zenhub @lucasmrod @mostlikelee

sharon-fdm commented 1 month ago

@zwass will it make sense that we open another ticket for this to reduce the scope here?

lukeheath commented 1 month ago

@lucasmrod Thanks for filing this!

@sharon-fdm Assigning over to you to prioritize. Let's keep the scope to only the fix for now, and log an error to the orbit logs. Notifying the server is a good idea, but would add enough scope to delay the fix.

lukeheath commented 1 month ago

@lucasmrod We received a suggestion to consider how the agent can continue to operate if the TUF server is unavailable altogether using a last known good configuration. I wanted to share the thought with you to consider as you're working on the immediate reboot loop issue.

zwass commented 1 month ago

The current code actually does make some attempt to proceed if it can't update the metadata: https://github.com/fleetdm/fleet/blob/c4c8efb5b18268868bb98b9143927c43bce2ef95/orbit/cmd/orbit/orbit.go#L498-L500

Unfortunately the error that we saw in this incident happens later and we return it up the stack which causes Orbit to exit.

2024-10-07T12:27:33-07:00 INF update metadata. using saved metadata error="update metadata: tuf: failed to decode root.json: expired at 2024-10-06 17:47:49 +0000 UTC"
2024-10-07T12:27:33-07:00 ERR run orbit failed error="target orbit lookup: lookup orbit: expired at 2024-10-06 17:47:49 +0000 UTC"

https://github.com/fleetdm/fleet/blob/c4c8efb5b18268868bb98b9143927c43bce2ef95/orbit/cmd/orbit/orbit.go#L509-L515

Calls

https://github.com/fleetdm/fleet/blob/c4c8efb5b18268868bb98b9143927c43bce2ef95/orbit/pkg/update/runner.go#L112-L119

(which even has some handling for one kind of error)

Calls

https://github.com/fleetdm/fleet/blob/c4c8efb5b18268868bb98b9143927c43bce2ef95/orbit/pkg/update/runner.go#L127-L130

Calls

https://github.com/fleetdm/fleet/blob/c4c8efb5b18268868bb98b9143927c43bce2ef95/orbit/pkg/update/update.go#L305-L308

Which gets the error from the TUF library when the metadata is expired.

The real trick here will be deciding the right place to handle errors, which errors to handle, and how to handle them, so that we don't end up masking issues where the system is not properly initialized.

lucasmrod commented 1 month ago

@xpkoala Added QA notes.

xpkoala commented 2 weeks ago

Tested all paths outlined and seeing no issues with connection staying alive. Tested against

macos 14.5 intel and m1
windows 11
fedora 40

fleet-release commented 2 days ago

Expired TUF, no fear, Fleetd starts, issues it clears, Ensured uptime, dear.

fleetdm / fleet

fleetd should not exit at startup if any of the TUF signatures are expired #22740

💥 Actual behavior

🌷 Expected behavior

🧑‍💻 Steps to reproduce

Reproduce bug

Change the following variable in ee/fleetctl/updates.go:

from:

keyExpirationDuration = 10 365 24 * time.Hour

to:

keyExpirationDuration = 30 * time.Minute

Build local TUF and generate packages.

B. TUF server down

C. Test expirations

C.1.1 Timestamp role expired

C.1.2 Install fleetd package with expired timestamp

C.2.1 Snapshot role expired

C.2.2 Snapshot role is fixed

C.3.1 Targets role expired

C.3.2 Targets role is fixed

C.4.1 Root role expired

C.4.2 Root role is fixed

C.5.1 Test with everything expired