Add critical policy and resolution data to device health API

dherder commented 7 months ago

Goal

User story
As an endpoint operator,
I want to get a count of failing critical policies and resolution steps in Fleet's device health API (`GET /hosts/:id/health`)
so that I can block end users' access to third party tools if they're failing > 1 critical policy and show them the resolution steps.

Context

Requestor(s): @dherder
Product designer: @noahtalerman

Changes

Product

[ ] REST API changes: API design is included in the PR to the REST API docs: https://github.com/fleetdm/fleet/pull/16982
[ ] Outdated documentation changes: Covered by the PR to the REST API docs: https://github.com/fleetdm/fleet/pull/16982
[ ] Changes to paid features or tiers: The new failing_critical_policies and critical properties are only available in Fleet Premium.

Engineering

[ ] Database schema migrations: TODO
[ ] Load testing: TODO

ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

Requires load testing: TODO
Risk level: Low / High TODO
Risk description: TODO

Manual testing steps

Step 1
Step 2
Step 3

Testing notes

Confirmation

[ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
[ ] QA (@____): Added comment to user story confirming successful completion of QA.

harrisonravazzolo commented 7 months ago

+1 🙇🏼

dherder commented 6 months ago

bringing back to Feature Fest as per @mikermcneil

harrisonravazzolo commented 6 months ago

Ideally, the endpoint would look something like this when hitting /api/v1/fleet/hosts/{deviceID}/health

Note the device_attestation value now in the payload. { "host_id": 1, "health": { "updated_at": "2023-09-16T18:52:19Z", "os_version": "MacOS 14.1.2", "disk_encryption_enabled": true, "device_attestation" : "passing" "failing_policies": [ { "id": 123, "name": "Google Chrome is not up to date", } ], "vulnerable_software": [ { "id": 321, "name": "Firefox.app", "version": "116.0.3", } ] } }

As we want this to be self-service remediation, we would use the failing_policies and vulnerable_software to craft the Slack message to the end user, like such:

dherder commented 6 months ago

@noahtalerman can we do this first (and then later do the suggestion @harrisonravazzolo notes in the above comment) : If anything from the "critical" policies are failing, the count of critical policies will be included at the top level of the device health response in a new key.

harrisonravazzolo commented 6 months ago

Yes, there are a couple of iterations of this flowing in my head - ask @mikermcneil about the 'weighted' concept per policy we talked about.

But as @dherder stated, as a start any policy marked as critical that failing, makes the device_attestation value go from passing to failing. Or a boolean value or whatever the smarter than me people think is the best verbiage.

noahtalerman commented 6 months ago

Hey @harrisonravazzolo for the first iteration, would a failing_policies_count work for your use case?

cc @dherder

harrisonravazzolo commented 6 months ago

Hey @noahtalerman - not really, I can explain.

We already have this value in the /hosts endpoint that I am currently leveraging for the device trust flows but it requires a bit of computation on my side to calculate a threshold of policies I want to determine if the device is considered passing or failing in my environment. Like I know we have 40 policies (this can change often), so off this number I need to see, per device, how many are failing and which ones. Not all policies are critical.

What I'm looking for is a new key in the health endpoint that is calculated by policies we mark as 'critical' - no critical failures = pass, critical failures = fail

For example, let's say that a zero-day in Chrome is patched and I want to create a basic policy that checks if Chrome.app > 121.0.6167.184. Create the policy, mark it as critical and now that is part of the calculation to the new key value returned.

If I wanted to do this now, I would have to hard-code this new policy into my automation to say, go fetch the device health for this host, now iterate through the list of failing policies, do a look up, if this value exists in failing policies, do this. It's a lot of computation that I think Fleet should be able to do easiliy to allow this to scale up.

noahtalerman commented 6 months ago

What I'm looking for is a new key in the health endpoint that is calculated by policies we mark as 'critical' - no critical failures = pass, critical failures = fail

@harrisonravazzolo ah, ok! Thanks.

If I'm understanding correctly, if a host is failing > 0 critical policies it's considered "unhealthy." At this point, the end user is blocked from third-party tools until they resolve the critical policies.

Would a failing_critical_policies_count work? If > 0, then the end user is blocked.

I imagine it would also be useful to add a critical property to each policy in the failing_policies array (GET /hosts/:id/health) so you can show the resolution steps for these policies to the user.

harrisonravazzolo commented 6 months ago

@noahtalerman I think this would work as a launching-off point!

I think for this use case it's best to keep it in the /health endpoint, as most of the other hosts endpoints return too much data.

I also like your suggestion of adding the critical property to the array, would be very helpful for presenting the resolution steps for sure.

noahtalerman commented 6 months ago

Hey @dherder I moved your original issue description here:

As a user using the device health api endpoint in my okta workflow, I want to have access to the number of failing policies at the top level of the json so that I don't have to use as many okta steps to determine whether or not to allow access

Problem

Today, I can only get the policy failure count from the hosts endpoint, not the device health endpoint. I want to make a single call to the device health endpoint and get the "Percent policy failure" and "Count policy failure" per host.

https://fleetdm.com/docs/rest-api/rest-api#get-hosts-device-health-report

If anything from the "critical" policies are failing, the count of critical policies will be included at the top level of the device health response in a new key.

noahtalerman commented 6 months ago

I think for this use case it's best to keep it in the /health endpoint, as most of the other hosts endpoints return too much data.

I also like your suggestion of adding the critical property to the array, would be very helpful for presenting the resolution steps for sure.

@harrisonravazzolo re including the info the the /health endpoint: agreed 💯

I also think this endpoint should return the resolution instructions so that you can show them to the user.

This pull request includes the proposed API changes: https://github.com/fleetdm/fleet/pull/16982

What do you think? Does this work for you?

harrisonravazzolo commented 6 months ago

giphy

noahtalerman commented 6 months ago

I reviewed this air guitar with @mikermcneil.

Let's move this user story forward with the formal drafting process leading to engineering.

@sharon-fdm heads up, I assigned this user story to you and moved it over to settled. I think it's ready for specs + estimation.

sharon-fdm commented 6 months ago

Sounds good @noahtalerman. I'll catch up on the thread here.

sharon-fdm commented 6 months ago

@noahtalerman long conversation here. What I understand is that this is the TL;DR:

As a user using the device health api endpoint in my okta workflow, I want to have access to the number of failing policies at the top level of the json so that I don't have to use as many okta steps to determine whether or not to allow access

Problem Today, I can only get the policy failure count from the hosts endpoint, not the device health endpoint. I want to make a single call to the device health endpoint and get the "Percent policy failure" and "Count policy failure" per host. https://fleetdm.com/docs/rest-api/rest-api#get-hosts-device-health-report

If anything from the "critical" policies are failing, the count of critical policies will be included at the top level of the device health response in a new key.

noahtalerman commented 6 months ago

I want to make a single call to the device health endpoint and get the "Percent policy failure" and "Count policy failure" per host.

@sharon-fdm not quite. Instead, I want to make a single call to the device health endpoint and get the count of failing critical policies and the count of failing policies per host.

For user stories, the comment section is used for ideating during drafting/design. We don't clean it up when a story is "Settled."

When a story is "Settled," please use the issue description for the summary of what we want to change: https://github.com/fleetdm/fleet/issues/16206#issue-2089022060

mostlikelee commented 5 months ago

waiting on comments in draft PR

noahtalerman commented 5 months ago

Hey @harrisonravazzolo, heads up, this improvement won't make it into the upcoming 4.48 release.

Plan is to ship this in the 4.49 release.

cc @sharon-fdm @dherder @Patagonia121

noahtalerman commented 5 months ago

this feature won't make it into the upcoming 4.48 release.

Plan is to ship this in the 4.49 release.

Also, FYI @spokanemac for the release article.

noahtalerman commented 4 months ago

Hey @dherder and @Patagonia121, heads up, this customer request was shipped in 4.49 🎉

Docs are still TODO. PR is here: https://github.com/fleetdm/fleet/pull/16982

rachaelshaw commented 4 months ago

New PR here: https://github.com/fleetdm/fleet/pull/18715 (to avoid messing with PR open time KPI)

rachaelshaw commented 4 months ago

Docs are merged

fleet-release commented 4 months ago

Policy data clear, Device health API secure, Fleet's path shines bright here.

fleetdm / fleet