Mark "verifying" or "verified" macos profiles as "failed" if osquery cannot confirm they are there

zhumo commented 1 year ago

Goal

User story
As an IT admin,
I want to know whether my config profile has been confirmed by osquery
so that I can be certain that my fleet has the correct config profiles.

Each distributed interval, run check the "expected" set of profiles against the "actual" set of profiles found by osquery
If any expected profile is not found among the actual profiles, set that profile to the "failed" status
When a profile is marked as "failed" in this manner, apply the error message as shown in the attached designs. Note that there are different error messages for a failed "verifying" profile vs. a failed "verified" profile.

Changes

This issue's estimation includes completing:

[ ] UI changes: https://www.figma.com/file/hdALBDsrti77QuDNSzLdkx/%F0%9F%9A%A7-Fleet-EE-(dev-ready%2C-scratchpad)?type=design&node-id=18262-227052
[ ] CLI usage changes: TODO
[ ] REST API changes: TODO
[ ] Permissions changes: TODO
[ ] Database schema migrations: TODO
[ ] Outdated documentation changes: TODO
[ ] Scope transparency changes? TODO
[ ] Breaking changes requiring major version bump? TODO
[ ] Changes to paid features or tiers? TODO
[ ] QA complete?
[ ] ...

ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

Context

Requestor(s): _____

QA

Risk assessment

[ ] Requires load testing TODO

Risk level: Low / High TODO

Risk description: TODO

Automated:

Fleet: Cover / Will not cover
QAWolf: Cover / Will not cover

Manual testing steps

Step 1
Step 2
Step 3

Testing notes

Confirmation

[ ] Engineer (@____): Added comment to user story confirming succesful completion of QA.
[ ] QA (@____): Added comment to user story confirming succesful completion of QA.

zhumo commented 1 year ago

@georgekarrv @gillespi314 @roperzh this is the story you should use for estimation tomorrow. We down-scoped the work.

zhumo commented 1 year ago

@georgekarrv @gillespi314 @roperzh hey, quick note here: I updated the requirements to reflect that there should be two separate error messages for a failed "verifying" profile vs. a failed "verified" profile. Hopefully that doesn't change the estimation too much.

gillespi314 commented 1 year ago

Dropping a link to this thread regarding edge cases for future reference and as something to be considered in the context of designing the retry feature.

tl;dr In some cases when MDM is turned on for a host or a host switches teams, unfortunate timing of the osquery detail query may cause a profile to get marked as failed. This can happen if the query runs during the window after the install profile command has been acknowledged by the host (i.e. Fleet status verifying) but before the profile is fully installed on the host. A few factors mitigate the impact of this edge case: In practice, this window should be quite narrow. And if it does occur, it is quite possible that the profiles will be in fact installed and the device is in the desired state even though it appears to be failed in Fleet (something that could be confirmed manually by the admin running a live query).

zhumo commented 1 year ago

thanks @gillespi314. In the event that happens, is it the case that when the distributed itnerval runs again, it'll check all expected profiles vs. all seen profiles and then re-mark them as failing or not? So then the second time the distributed interval runs, the profile will be properly marked?

gillespi314 commented 1 year ago

In the event that happens, is it the case that when the distributed interval runs again, it'll check all expected profiles vs. all seen profiles and then re-mark them as failing or not? So then the second time the distributed interval runs, the profile will be properly marked?

As currently implemented, the status won't change once it reaches the failed state. It's something we could potentially implement as an in-between step short of redelivering failed profiles.

noahtalerman commented 1 year ago

Thanks @gillespi314.

the status won't change once it reaches the failed state...we could potentially implement as an in-between step short of redelivering failed profiles.

Got it. I think we'll want to implement something to properly mark the profile.

I added this to the redeliver story so that we have something tracked:

At each distributed interval, check all expected profiles v. all seen profiles and re-mark them as "Failed" or "Verified"

This means that "Pending" profiles will be moved to "Failed" if they're missing. "Failed" profiles will be moved to "Verified" if they're present.

Does that make sense to you?

cc @zhumo

gillespi314 commented 1 year ago

@noahtalerman Yes, that lines up with what I was thinking too.

sabrinabuckets commented 1 year ago

Able to get a device into Failed status in case one (profile installed successfully but unable to verify): Screenshot 2023-06-23 at 10.34.49 AM.png Screenshot 2023-06-23 at 10.37.03 AM.png

Will proceed with attempting to force case two (verified but since found missing).

sabrinabuckets commented 1 year ago

@gillespi314 in my above comment, I was working with a device that I enrolled & accidentally transferred teams before the verification completed, so I was assuming that was the cause of the failure (possibly related to 12452?). However, I have had two machine in a row reach Failed status—despite successful profile delivery & disk encryption flow—without any intervention during enrollment. Is that something already being accounted for & I should hold off further testing, or does that sound like a new issue?

sabrinabuckets commented 1 year ago

Testing the second Failed state—previously Verified but found missing—proved to be difficult. Apple has changed the behavior of config profiles to be unremovable. even on manually enrolled devices, with the exception of the MDM enrollment profile. Attempting to remove a custom profile via the UI and the command line both failed, and removing the MDM profile only triggered the re-enroll prompt.

However, @roperzh was able to point me to a command that could be run via fleetctl that forced a profile removal, and I was able to verify the error message: Screenshot 2023-06-27 at 4.14.18 PM.png

This secondary Failed state is unlikely to affect many users, given the difficulty with removal.

zhumo commented 1 year ago

C&C: @noahtalerman to check that the profile status docs are updated with this information. https://fleetdm.com/docs/using-fleet/mdm-custom-macos-settings#step-3-confirm-the-setting-is-enforced

noahtalerman commented 1 year ago

check that the profile status docs are updated with this information. https://fleetdm.com/docs/using-fleet/mdm-custom-macos-settings#step-3-confirm-the-setting-is-enforced

This PR to the docs adds this information: #12806

fleet-release commented 1 year ago

Profiles checked each beat, MacOS fleet now complete, No error, just neat.

fleetdm / fleet