Add a unified way to surface important notifications to users

mna commented 1 year ago

Problem

As we add more interactions with third-party systems (e.g. Apple MDM) and more asynchronous work (cron jobs, MDM commands, etc.), the need for a way to surface important failures (or even more generally results) to the users is growing and we don't have a unified way to do this currently.

For example, we may validate at startup that the provided Apple Business Manager token is valid, but it may become invalid soon after, or Apple's terms may have changed and need to be accepted by the user before the token starts working again, etc. Many such failure scenarios require a user intervention and possibly a configuration change, and may happen at any time after the fleet instances startup.

Requirements

TBD. This is a reference ticket to keep track of this feature.

Potential Solutions

See comments below for a transcription of the slack thread for reference.

mna commented 1 year ago

Slack retranscription (from me):

Following a comment that Tomás made here (https://github.com/fleetdm/fleet/pull/8730#discussion_r1028317296), and a growing number of validations made at Fleet startup (on fleet serve, so on each fleet instance, e.g. https://github.com/fleetdm/fleet/issues/8725#issuecomment-1332294940), I wanted to raise a larger discussion about this.

When running fleet manually at the command line (interactively, user in front of the screen), those validations make sense and are decently user-friendly: if you provide e.g. an Apple Business Manager token that is expired, you get an immediate error message and Fleet is prevented from running. Same for e.g. an expired/revoked Apple APNs certificate. It's all good, you change the flag's value to point to a valid certificate/token/etc. and start it again.

However, in automated deployments similar to the ones we do in AWS for load testing/dogfood, it is not so obvious that this is the best experience - your fleet instances will fail to start and will likely enter a loop where AWS tries to get N instances up, and they all fail repeatedly due to e.g. an expired Apple BM token in its startup configuration. Not to mention the person in charge of the deployment might not be the person that can fix the thing that is failing.

Even worse (and to the initial point that Tomas was raising), as we keep adding validations at startup involving third-party network requests (e.g. pinging an Apple API to ensure the provided certificate is valid), we may encounter network slowdowns that make the check take too much time, resulting in AWS thinking the Fleet instance didn't come up, killing the instance and trying to spin up another one. Similarly, it could fail to start due to a transient network failure, even though the certificate/token/etc. that we wanted to validate is valid.

And even if there were no issues at startup, that validation is only partially useful - minutes later, the token that was valid at startup could be expired, or invalidated, the certificate could be revoked, etc. We have no good, unified way to capture those "runtime" issues and bring them up to the user's attention (basically, I think it just ends up in the logs). We have a ticket planned to show a banner when Apple BM terms have changed and the user needs to accept them (similar to how we show a banner if the fleet license is expired), but this is special-casing one of the many potential failures (https://github.com/fleetdm/fleet/issues/8537).

I wonder if we should plan a Notification system where all such issues can bubble up and be brought to the user's attention. I can think of many failures that could benefit from that, in addition to the Apple's terms changing:

Apple BM server token expiring
Apple BM server token being invalidated due to a change of password of the token's account
Apple APNs/SCEP certificate expiring/being revoked
MDM enrollment's default team not existing (we validate that the team exists when setting the default team, but nothing prevents the team from being deleted/renamed, and we match it by name), resulting in failure to add newly enrolled hosts in a team

That's just thinking about recent MDM-related stuff, I'm sure there are failure modes in the cron jobs that would be worth bringing forth in a notification system.

With that in place, I think we could limit the startup validations to just the required dependencies of Fleet itself (i.e. mysql, redis, ...) and anything third-party could be lazily validated on first use and raised as a notification, avoiding any network-related issues at startup (whether its the slowdown, the transient errors, etc.). And I guess it would be easier to bring the issue to the right person that way (e.g. the person that can renew the Apple BM token).

From @chiiph :

it sounds like we'll want to have a "health check" cron that can populate a table with last checks that we can display in fleetctl. We can check as many things as we want here

that will also help a lot with debugging customer's deployments

as we add more options and moving parts, we'll need to improve introspection in a way like this

unless anybody is drastically opposed to this, this sounds like something that should go on a ticket, and we can work on it as part of the overall "holistic underlying tech stuff for mdm"

noahtalerman commented 1 year ago

Hey @mna is this work required to complete the "Accept new terms for Apple Business Manager" epic? I'm trying to understand if this is something we need to work on now or later.

mna commented 1 year ago

@noahtalerman No this is not required, it's more an improvement to think about in the longer term (a way to asynchronously notify users of different "background" issues).

noahtalerman commented 1 year ago

more an improvement to think about in the longer term

Got it!

roperzh commented 9 months ago

because we don't have this in place, we prevent the server from starting in some scenarios instead. This caused an outage for us in Dogfood because:

ABM certs were expired
A deploy was triggered, which caused fleet to restart
Fleet couldn't restart because the cert validation failed

mna commented 9 months ago

As mentioned by @rfairburn in Slack (https://fleetdm.slack.com/archives/C019WG4GH0A/p1704764862351069?thread_ts=1704759755.318459&cid=C019WG4GH0A), a quick(er) win could be to just send an email notification (possibly a few days before the expiration for things that have specific expiration dates like certs) and still prevent starting if it ends up expiring. This would just require an email to be configured for such critical notifications (and SMTP settings). Could be a first step towards the "notifications dashboard", or could end up being good enough for a while.

I mention this because there are surely some subtle and complex things to address if we want Fleet to start even when some settings are invalid (e.g. if MDM certs are invalid and we still start Fleet but with MDM disabled, would that remove profiles, result in weird errors in some pages, etc.?).

nonpunctual commented 8 months ago

20240209 Robert Fairburn

All, here is the list of expire dates for certs in cloud for MDM stuff: robert@Roberts-MacBook-Pro scripts % ./check_mdm_certs.py | sort -k2 ... Please note we are X days away from APNS cert from expiring. I am glad I did this today and didn't delay.

For customers that are self-hosted, they should be getting notifications from  at 60d & 30d before renewals, however, the problem is that often an email address is used to set up the account that isn't monitored. This is actually considered a best practice so that certs are not tied to email addresses that are retired if someone leaves an organization (which I have seen 1st hand.)

I would like to see these notification features prioritized if possible so customers get more awareness of cert renewals to avoid re-enrollment.

@noahtalerman I would like to understand how we see the customer impact of what I am describing as "re-enrollment" which to me means: devices which were managed are no longer managed. You described the impact as follows:

Noah Talerman :spiral_calendar_pad: [23 minutes ago] I did not explain that they would have to turn MDM off and on if they forgot Ah, ok. Maybe we shoot them a reminder message in Slack that includes this? I think we should make sure to communicate the consequences of forgetting for renew APNs: all end users will have to go into System Settings > Profiles to turn MDM off and then go to Fleet Desktop to turn MDM back on.

That doesn't sound as bad to me as "devices are no longer managed". I am not at all meaning to overreact to this, but, for Jamf customers, if we were promising device management & then somehow (eg, a cert expiration) Jamf was responsible for devices becoming unmanaged that would be catastrophically bad customer experience.

Customers did have to re-enroll devices on a massive scale because of cert expiry. I don't think it was ever directly Jamf's fault but the customers certainly felt that is was because they weren't made aware of how important cert renewals were.

also see: https://github.com/fleetdm/fleet/issues/11544

noahtalerman commented 8 months ago

Hey @nonpunctual, the "in-Fleet" notifications are covered by the separate issue you linked to here: #11544

This issue is in the current design sprint. The plan is to work on it in the next engineering sprint.

I think we can close this issue (#8935) as a duplicate? @mna please correct me if I'm wrong.

mna commented 8 months ago

@noahtalerman Yeah, this issue is more broad/general-purpose than certs expiration, but certs are probably the most important notifications to surface right now, and we can open more targeted issues when/if we have other bits and pieces that we want to notify the users about. I'll close it, feel free to reopen if you disagree following my comment.

fleet-release commented 8 months ago

Notifications arrive, Guiding users through changes, Fleet's light in the cloud.

fleetdm / fleet