Error in MDM checkout command handler short circuits database cleanups and audit log

gillespi314 commented 1 year ago

🧑‍💻 Expected behavior

When a device checks out (unenrolls) from Fleet MDM, we want to log that activity and cleanup related entries in the database. As part of the log entry, we want to indicate whether the device is checking out of automatic (DEP) or manual MDM. If there is an error getting the manual/automatic MDM status from the DB, the cleanups should still proceed and the activity log should be updated with an indication of missing information.

💥 Actual behavior

An error reading manual/automatic MDM status from the DB short circuits the cleanups and activity logging.

More info

Consider restructuring the checkout method to handle these scenarios gracefully. https://github.com/fleetdm/fleet/blob/3b942030c989483ec7f53069e4f9b2b930151858/server/service/apple_mdm.go#L1006-L1026

xpkoala commented 1 year ago

@gillespi314 Does this need reproducing from me or is there enough information for work to start on this?

gillespi314 commented 1 year ago

@xpkoala no need for reproduction

lukeheath commented 1 year ago

@gillespi314 Assigning you to determine with @lucasmrod and @roperzh to determine:

How do we want to handle this edge case?
Is this a bug we should address now, or tech debt we can defer to later?

xpkoala commented 1 year ago

@lukeheath added to the product board. This might have been lost a week or two ago.

mikermcneil commented 1 year ago

Thanks for reporting this @gillespi314!

Tech debt is one of those words that can mean different things to different people, so just dropping in some food for thought re: prioritization:

What's the impact (worst case scenario) to users of leaving this unresolved? If not broken, would it feel unexpected? How hard would someone have to be looking for it to feel unexpected?
What about the worst case scenario to contributors? Will it lead to misunderstandings and bugs? Will it lead to introducing misinformed changes? What about if there's a production incident? Does this make it harder to understand what's going on in the database? Would someone dealing with an urgent issue misunderstand the state of the database because of this?
If there's sufficient user impact or violation of expectations, it's a bug and worth prioritizing an unplanned bug fix ASAP in the current sprint.
If there's sufficient contributor impact / risk, then it's related to #contributor-experience and is worth prioritizing for the next sprint.

From what I can read here, it sounds like not fixing this means that user databases are getting corrupted with data that will have to be cleaned up in a future database migration job. The longer we wait to fix it, the more permutations of corrupted data there are that could trip us up and that we'll eventually have to fix.

To me, it sounds like it might already be a bug with user impact as well, or it might not be, but that we don't actually know for sure, because we haven't let this run and get corrupted in every possible way, and then done manual QA with automated tests.

If so, then It's probably easier to just fix it than to find out definitively, right?

Up to @lukeheath

mikermcneil commented 1 year ago

Taking a pass at this:

As a user in Fleet who had recently fiddled around in (or accidentally messed up) the MySQL database, I don't want Fleet to silently fail to track activity and pretend nothing went wrong.

lukeheath commented 1 year ago

Closing out in favor #9915. Thanks @gillespi314

fleet-release commented 1 year ago

Error resolved, swift Cleanups and logs prosper Cloud city gleams bright

fleetdm / fleet