Closed lucasmrod closed 2 days ago
@zwass @lukeheath This fix is to reduce impact of the P0 #22687, so that an expired root signature does not bring existing fleetd instances down (and only impact package generation).
@lucasmrod, I removed the :reproduce
label since I assume you did it.
Put it back if it's needed.
Ideally we should have fleetd report to the Fleet server and the server have some way to notify admins that this is happening.
Hey team! Please add your planning poker estimate with Zenhub @lucasmrod @mostlikelee
@zwass will it make sense that we open another ticket for this to reduce the scope here?
@lucasmrod Thanks for filing this!
@sharon-fdm Assigning over to you to prioritize. Let's keep the scope to only the fix for now, and log an error to the orbit logs. Notifying the server is a good idea, but would add enough scope to delay the fix.
@lucasmrod We received a suggestion to consider how the agent can continue to operate if the TUF server is unavailable altogether using a last known good configuration. I wanted to share the thought with you to consider as you're working on the immediate reboot loop issue.
The current code actually does make some attempt to proceed if it can't update the metadata: https://github.com/fleetdm/fleet/blob/c4c8efb5b18268868bb98b9143927c43bce2ef95/orbit/cmd/orbit/orbit.go#L498-L500
Unfortunately the error that we saw in this incident happens later and we return it up the stack which causes Orbit to exit.
2024-10-07T12:27:33-07:00 INF update metadata. using saved metadata error="update metadata: tuf: failed to decode root.json: expired at 2024-10-06 17:47:49 +0000 UTC"
2024-10-07T12:27:33-07:00 ERR run orbit failed error="target orbit lookup: lookup orbit: expired at 2024-10-06 17:47:49 +0000 UTC"
Calls
(which even has some handling for one kind of error)
Calls
Calls
Which gets the error from the TUF library when the metadata is expired.
The real trick here will be deciding the right place to handle errors, which errors to handle, and how to handle them, so that we don't end up masking issues where the system is not properly initialized.
@xpkoala Added QA notes.
Tested all paths outlined and seeing no issues with connection staying alive. Tested against
Expired TUF, no fear, Fleetd starts, issues it clears, Ensured uptime, dear.
fleetd version: all.
This bug is related to the P0 #22687.
💥 Actual behavior
Currently, if Fleet's TUF root.json signature is expired fleetd won't start up (exits early over and over because it's a service). The following logs repeated over and over:
Causing hosts that restarted after the expiration event to be offline.
🌷 Expected behavior
If fleet's TUF root signature is expired, fleetd should log an error and continue its execution (not exit).
🧑💻 Steps to reproduce
Reproduce bug
Follow steps below to generate a local TUF repository with an expired root.json and reproduce the issue (to reproduce
main
or whatever branch you have checked out should not have the fleetd fix for this issue):Change the following variable in ee/fleetctl/updates.go:
from:
keyExpirationDuration = 10 365 24 * time.Hour
to:
keyExpirationDuration = 30 * time.Minute
Build local TUF and generate packages.
SYSTEMS="macos windows linux" \ PKG_FLEET_URL=https://host.docker.internal:8080 \ PKG_TUF_URL=http://host.docker.internal:8081 \ DEB_FLEET_URL=https://host.docker.internal:8080 \ DEB_TUF_URL=http://host.docker.internal:8081 \ MSI_FLEET_URL=https://host.docker.internal:8080 \ MSI_TUF_URL=http://host.docker.internal:8081 \ GENERATE_PKG=1 \ GENERATE_DEB=1 \ GENERATE_MSI=1 \ ENROLL_SECRET=... \ FLEET_DESKTOP=1 \ USE_FLEET_SERVER_CERTIFICATE=1 \ ./tools/tuf/test/main.sh
./build/fleetctl package --type=pkg --fleet-url=https://host.docker.internal:8080 --enroll-secret=<...> --update-roots=$(./build/fleetctl updates roots --path ./test_tuf) --disable-open-folder --update-interval=1m --debug --update-url=http://host.docker.internal:8081 --enable-scripts
sudo systemctl restart orbit
sudo launchctl unload /Library/LaunchDaemons/com.fleetdm.orbit.plist && sudo launchctl load /Library/LaunchDaemons/com.fleetdm.orbit.plist
Test fleetd restart works a-ok.
Then push an update to such
edge
channel for the three components and check if they are auto-updated:B. TUF server down
TUF server being down should cause no issues to fleetd (just errors in fleetd logs):
To restart the TUF server
Then restart fleetd and again there should be no issues, just errors in fleetd logs.
C. Test expirations
C.1.1 Timestamp role expired
We can reuse any running local TUF repository and run the following to expire the timestamp role signature:
C.1.2 Install fleetd package with expired timestamp
fleetd should be installed successfully and work as usual (just error in logs).
C.2.1 Snapshot role expired
We'll rebuild the TUF repository to avoid error-prone steps when testing snapshot expiration. You will need to re-install packages on the three OSs.
By rebuild I mean start from scratch:
Run same
main.sh
invocation but with the following additional variable:Which means the snapshot role will expire after 10m (10m to give time for the packages to be generated).
Wait for snapshot role to expire, agents should continue to work (just error logs around snapshot being expired).
Restart fleetd and make sure it's up and running after the restart, just TUF error in the logs.
C.2.2 Snapshot role is fixed
Fix the snapshot expiration by rotating the root key:
Agents should restart after some time and then continue to work as usual (no more TUF error logs).
C.3.1 Targets role expired
We'll rebuild the TUF repository to avoid error-prone steps when testing targets expiration. You will need to re-install packages on the three OSs.
By rebuild I mean start from scratch:
Run same
main.sh
invocation but with the following additional variable:Which means the targets role will expire after 10m (10m to give time for the packages to be generated).
Wait for targets role to expire, agents should continue to work (just error logs around targets being expired)
Restart fleetd and make sure it's up and running after the restart, just TUF error in the logs.
C.3.2 Targets role is fixed
Fix the targets expiration by rotating the root key:
Agents should restart after some time and then continue to work as usual (no more TUF error logs).
C.4.1 Root role expired
We'll rebuild the TUF repository to avoid error-prone steps when testing targets expiration. You will need to re-install packages on the three OSs.
By rebuild I mean start from scratch:
Run same
main.sh
invocation but with the following additional variable:Which means the root role will expire after 10m (10m to give time for the packages to be generated).
Wait for root role to expire, agents should continue to work (just error logs around root being expired)
Restart fleetd and make sure it's up and running after the restart, just TUF error in the logs.
C.4.2 Root role is fixed
Fix the root expiration by rotating the root key:
Agents should restart after some time and then continue to work as usual (no more TUF error logs).
C.5.1 Test with everything expired
We'll rebuild the TUF repository to avoid error-prone steps when testing targets expiration. You will need to re-install packages on the three OSs.
By rebuild I mean start from scratch:
Run same
main.sh
invocation but with the following additional variables:Which means all roles will expire after 10m (10m to give time for the packages to be generated).
Wait for all role to expire, agents should continue to work (just error logs around root being expired)
Restart fleetd and make sure it's up and running after the restart, just TUF error in the logs.