Closed chuckyz closed 4 weeks ago
Hey @chuckyz, thanks for opening the PR with the improvement. I updated this issue to use Fleet's standard user story template.
I moved your original issue description below. Please let me know if I'm missing anything in the updated description!
When Orbit is mass deployed in any situation, if there's an issue during that deployment that causes the enroll step to retry, the retry is consistent. In some cases this consistent time is too fast. This causes a lot of stress to the server cluster.
FLEETD_ENROLL_RETRY_INTERVAL
retry
package.Hey @sharon-fdm and @lucasmrod heads up, since there's an open PR for this user story, I pulled this user story into the release board.
This way, we can track the progress of getting the PR reviewed and merged in the upcoming sprint.
cc @lukeheath
@lucasmrod @sharon-fdm Should this be in the "In review" column?
@lukeheath This is a PR from community that needs some modifications. Hasn't started yet.
Hi @chuckyz!
Next week I will be working on this during the current sprint. I may have to make a separate PR because I can't push to your fork (unless you have the time and are planning on making the requested changes).
Let me know what works for you.
@noahtalerman I didn't add any new configuration. The default will now be to always do exponential backoff whenever there are enroll failures in fleetd (https://github.com/fleetdm/fleet/pull/17368/files#r1512885699).
(I don't see a reason to not do backoff when there are fleetd enroll failures.)
Let me know if it makes sense.
I didn't add any new configuration. The default will now be to always do exponential backoff whenever there are enroll failures in fleetd
@lucasmrod nice! Not adding any new configuration and instead updating the behavior for everyone (default) is always a win.
Makes sense that we should back off by default.
@chuckyz what do you think? Heads up, Lucas opened a fresh PR here: #17368
If you get the chance, would love your feedback.
Following are the scenarios to test for QA:
@xpkoala/@sabrinabuckets
All tests must be performed in the three OSs.
Scenarios:
A. Test a package with an invalid enroll secret:
SYSTEMS="macos windows linux" \
PKG_FLEET_URL=https://localhost:8080 \
PKG_TUF_URL=http://localhost:8081 \
DEB_FLEET_URL=https://host.docker.internal:8080 \
DEB_TUF_URL=http://host.docker.internal:8081 \
RPM_FLEET_URL=https://host.docker.internal:8080 \
RPM_TUF_URL=http://host.docker.internal:8081 \
MSI_FLEET_URL=https://host.docker.internal:8080 \
MSI_TUF_URL=http://host.docker.internal:8081 \
GENERATE_PKG=1 \
GENERATE_DEB=1 \
GENERATE_RPM=1 \
GENERATE_MSI=1 \
ENROLL_SECRET=INVALID_ENROLL_SECRET \
FLEET_DESKTOP=1 \
USE_FLEET_SERVER_CERTIFICATE=1 \
DEBUG=1 \
./tools/tuf/test/main.sh
Expected result: You should see enroll failures and retries with a backoff: 10s, 20s, 40s, 80s, 160s, and then it starts over.
B. After (A) is done, push a dummy update to orbit and it should auto-update (even if it hasn't enrolled to Fleet) (It may take up to 5 minutes for it to auto-update.)
# Dummy change to change the output of `sudo orbit version`
sed -i '' 's/fmt.Println("orbit "/fmt.Println("orbit2 "/' ./orbit/cmd/orbit/orbit.go
# GOARCH=arm64 in case in M1
GOOS=darwin GOARCH=amd64 go build -o orbit-darwin ./orbit/cmd/orbit
./tools/tuf/test/push_target.sh macos orbit orbit-darwin 42
# Verify that it auto-updated successfully you can run:
sudo orbit version
C. Smoke test packages with a valid enroll secret (fleetd should enroll successfully).
D. After testing (C), delete the three hosts from Fleet and they should re-enroll successfully.
E. After (D) is done, push a dummy update to orbit and it should auto-update successfully.
# Dummy change to change the output of `sudo orbit version`
sed -i '' 's/fmt.Println("orbit2 "/fmt.Println("orbit3 "/' ./orbit/cmd/orbit/orbit.go
# GOARCH=arm64 in case in M1
GOOS=darwin GOARCH=amd64 go build -o orbit-darwin ./orbit/cmd/orbit
./tools/tuf/test/push_target.sh macos orbit orbit-darwin 42
# To verify it auto-updated successfully you can run:
sudo orbit version
@noahtalerman was this supposed to be closed out?
Hey @zayhanlon, yes. Looking at the date this was moved to the drafting board (Apr 4), I think this one got lost in the ZenHub boards.
@lucasmrod, did this story release a new config or did we update the default behavior for everyone?
If we added a new config, what's the config? I think we want to make sure we updated any reference docs (server config options, API, GitOps, Agent options).
I just realized that Lucas is OOO.
did this story release a new config or did we update the default behavior for everyone?
If we added a new config, what's the config? I think we want to make sure we updated any reference docs (server config options, API, GitOps, Agent options).
Hey @sharon-fdm do you know the answer to the above?
See https://github.com/fleetdm/fleet/issues/16594#issuecomment-1978928746.
No new configuration.
By default, upon enroll failures, fleetd will use the following retry intervals (exponential backoff): 10s, 20s, 40s, 80s, 160s and then return the failure (max attempts = 6) thus executing no more than ~6 enroll request failures every ~5 minutes.
No new configuration.
By default, upon enroll failures, fleetd will use the following retry intervals (exponential backoff): 10s, 20s, 40s, 80s, 160s and then return the failure (max attempts = 6) thus executing no more than ~6 enroll request failures every ~5 minutes.
Thanks Lucas! Closing this issue.
cc @zayhanlon
Orbit's steady pulse, Tamed by thoughtful code and care, Servers breathe easier.
Goal
Changes
Product
Engineering
Context
QA
Risk assessment
Manual testing steps
Testing notes
Confirmation