Configure Orbit (component of fleetd) enrollment retry backoff

chuckyz commented 7 months ago

Goal

User story
As a endpoint operator deploying Fleet's agent (fleetd) on thousands of hosts,
I want to configure Orbit's (component of fleetd) enrollment retry back off if enrollment fails
so that I can reduce the amount of stress on the Fleet server.

Changes

Product

[x] fleetd changes: https://github.com/fleetdm/fleet/pull/17368
[ ] Outdated documentation changes: No documentation needed.

Engineering

[ ] Database schema migrations: TODO
[ ] Load testing: TODO

ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

Context

Requestor(s): _____

QA

Risk assessment

Requires load testing: TODO
Risk level: Low / High TODO
Risk description: TODO

Manual testing steps

Step 1
Step 2
Step 3

Testing notes

Confirmation

[ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
[ ] QA (@____): Added comment to user story confirming successful completion of QA.

noahtalerman commented 6 months ago

Hey @chuckyz, thanks for opening the PR with the improvement. I updated this issue to use Fleet's standard user story template.

I moved your original issue description below. Please let me know if I'm missing anything in the updated description!

Problem

When Orbit is mass deployed in any situation, if there's an issue during that deployment that causes the enroll step to retry, the retry is consistent. In some cases this consistent time is too fast. This causes a lot of stress to the server cluster.

Potential solutions

Increase FLEETD_ENROLL_RETRY_INTERVAL
Add a basic backoff mechanism into the retry package.

noahtalerman commented 6 months ago

Hey @sharon-fdm and @lucasmrod heads up, since there's an open PR for this user story, I pulled this user story into the release board.

This way, we can track the progress of getting the PR reviewed and merged in the upcoming sprint.

cc @lukeheath

lukeheath commented 6 months ago

@lucasmrod @sharon-fdm Should this be in the "In review" column?

sharon-fdm commented 6 months ago

@lukeheath This is a PR from community that needs some modifications. Hasn't started yet.

lucasmrod commented 6 months ago

Hi @chuckyz!

Next week I will be working on this during the current sprint. I may have to make a separate PR because I can't push to your fork (unless you have the time and are planning on making the requested changes).

Let me know what works for you.

lucasmrod commented 6 months ago

@noahtalerman I didn't add any new configuration. The default will now be to always do exponential backoff whenever there are enroll failures in fleetd (https://github.com/fleetdm/fleet/pull/17368/files#r1512885699).

(I don't see a reason to not do backoff when there are fleetd enroll failures.)

Let me know if it makes sense.

noahtalerman commented 6 months ago

I didn't add any new configuration. The default will now be to always do exponential backoff whenever there are enroll failures in fleetd

@lucasmrod nice! Not adding any new configuration and instead updating the behavior for everyone (default) is always a win.

Makes sense that we should back off by default.

@chuckyz what do you think? Heads up, Lucas opened a fresh PR here: #17368

If you get the chance, would love your feedback.

lucasmrod commented 6 months ago

Following are the scenarios to test for QA:

@xpkoala/@sabrinabuckets

All tests must be performed in the three OSs.

Scenarios:

A. Test a package with an invalid enroll secret:

SYSTEMS="macos windows linux" \
PKG_FLEET_URL=https://localhost:8080 \
PKG_TUF_URL=http://localhost:8081 \
DEB_FLEET_URL=https://host.docker.internal:8080 \
DEB_TUF_URL=http://host.docker.internal:8081 \
RPM_FLEET_URL=https://host.docker.internal:8080 \
RPM_TUF_URL=http://host.docker.internal:8081 \
MSI_FLEET_URL=https://host.docker.internal:8080 \
MSI_TUF_URL=http://host.docker.internal:8081 \
GENERATE_PKG=1 \
GENERATE_DEB=1 \
GENERATE_RPM=1 \
GENERATE_MSI=1 \
ENROLL_SECRET=INVALID_ENROLL_SECRET \
FLEET_DESKTOP=1 \
USE_FLEET_SERVER_CERTIFICATE=1 \
DEBUG=1 \
./tools/tuf/test/main.sh

Expected result: You should see enroll failures and retries with a backoff: 10s, 20s, 40s, 80s, 160s, and then it starts over.

B. After (A) is done, push a dummy update to orbit and it should auto-update (even if it hasn't enrolled to Fleet) (It may take up to 5 minutes for it to auto-update.)

# Dummy change to change the output of `sudo orbit version`
sed -i '' 's/fmt.Println("orbit "/fmt.Println("orbit2 "/' ./orbit/cmd/orbit/orbit.go

# GOARCH=arm64 in case in M1
GOOS=darwin GOARCH=amd64 go build -o orbit-darwin ./orbit/cmd/orbit
./tools/tuf/test/push_target.sh macos orbit orbit-darwin 42

# Verify that it auto-updated successfully you can run:
sudo orbit version

C. Smoke test packages with a valid enroll secret (fleetd should enroll successfully).

D. After testing (C), delete the three hosts from Fleet and they should re-enroll successfully.

E. After (D) is done, push a dummy update to orbit and it should auto-update successfully.

# Dummy change to change the output of `sudo orbit version`
sed -i '' 's/fmt.Println("orbit2 "/fmt.Println("orbit3 "/' ./orbit/cmd/orbit/orbit.go

# GOARCH=arm64 in case in M1
GOOS=darwin GOARCH=amd64 go build -o orbit-darwin ./orbit/cmd/orbit
./tools/tuf/test/push_target.sh macos orbit orbit-darwin 42

# To verify it auto-updated successfully you can run:
sudo orbit version

zayhanlon commented 1 month ago

@noahtalerman was this supposed to be closed out?

noahtalerman commented 1 month ago

Hey @zayhanlon, yes. Looking at the date this was moved to the drafting board (Apr 4), I think this one got lost in the ZenHub boards.

@lucasmrod, did this story release a new config or did we update the default behavior for everyone?

If we added a new config, what's the config? I think we want to make sure we updated any reference docs (server config options, API, GitOps, Agent options).

noahtalerman commented 1 month ago

I just realized that Lucas is OOO.

did this story release a new config or did we update the default behavior for everyone?

If we added a new config, what's the config? I think we want to make sure we updated any reference docs (server config options, API, GitOps, Agent options).

Hey @sharon-fdm do you know the answer to the above?

lucasmrod commented 4 weeks ago

See https://github.com/fleetdm/fleet/issues/16594#issuecomment-1978928746.

No new configuration.

By default, upon enroll failures, fleetd will use the following retry intervals (exponential backoff): 10s, 20s, 40s, 80s, 160s and then return the failure (max attempts = 6) thus executing no more than ~6 enroll request failures every ~5 minutes.

noahtalerman commented 4 weeks ago

No new configuration.

By default, upon enroll failures, fleetd will use the following retry intervals (exponential backoff): 10s, 20s, 40s, 80s, 160s and then return the failure (max attempts = 6) thus executing no more than ~6 enroll request failures every ~5 minutes.

Thanks Lucas! Closing this issue.

cc @zayhanlon

fleet-release commented 4 weeks ago

Orbit's steady pulse, Tamed by thoughtful code and care, Servers breathe easier.

fleetdm / fleet