fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
3.09k stars 426 forks source link

Add ability to prioritize vulnerable software #4512

Closed noahtalerman closed 2 years ago

noahtalerman commented 2 years ago

Problem

I'm a user managing thousands of macOS, Windows, and/or Linux hosts and I'm overwhelmed with completing the following goal:

It's hard to make progress on the goal because I don't know where I should start. What vulnerable software should be updated/removed first?

Goals

  1. Know which vulnerable software should be updated/removed first.
    • We'll accomplish this by adding the ability to know the probability of exploit (reported by FIRST.org/epss), CVSS base score (reported by NVD), and whether or not there's a known exploit (reported by CISA) for each vulnerability.
    • Also, we'll order the Software table by probability of exploit.

Figma

Child issues

docs

noahtalerman commented 2 years ago

Potential way to determine priority of updating installed software...

noahtalerman commented 2 years ago

Feedback from Josh Brower:

mikermcneil commented 2 years ago

Feedback from customers:

Ryan: Hey just a follow-up from the meeting... I had a quick thought on how we might want to approach the vulnerability ranking system. Since we're essentially working with ya'll to help define heuristics, I was thinking the below might be a good place to start. This could be scaled to your other customers as they could define their own metrics, weights, etc... 1) Define risk categories (Low, Medium, High, Critical) 2) Define metrics

Tony Gauda 4 hours ago Ryan- thanks for the suggestion. @Noah Talerman @mikermcneil FYI

mikermcneil < 1 minute ago I like it! Thanks Ryan. Up to @Noah Talerman on how to integrate this into the next our strategy and wireframes. Looking forward to reviewing together. One thing that's coming to mind for me: from a UX perspective, I'd like to see us come up with variations (or some way of clearly distinguishing) "Low, Medium, High, Critical" risk scores versus what those words mean in CVSS-land.

I'll paste the above discussion in the issue so others in the community can participate.

mikermcneil commented 2 years ago

@noahtalerman One thing I really like about how you're thinking about risk scoring is basing it on patching timeframes / SLAs. Maybe this is "SLA"?

rymurph20 commented 2 years ago

I think we were having trouble creating some sort of distinction between Critical and High in the meeting. I would separate the two (briefly) like this...

Critical: Danger is imminent from remote attackers; drop everything and fix. (e.g. Log4Shell) High: Vulnerability is trivial to exploit and should be prioritized but danger isn't necessarily imminent from remote attackers (e.g. LPE like Dirty Pipe, PolKit)

mikermcneil commented 2 years ago

@rymurph20 Fair to say by that definition: "Critical" == ≤24h, "High" == 1 week?

Also: @cjwalton

I get it about not having conflicting namespace with CVSS on vuln severity. I don't know if you are familiar with Traffic Light Protocol, but I wonder if there is something similar for security severity that can be referenced. I could research that.

That could work. We could even just call it "SLA" and have there be 4 convention over configuration levels, starting out based on feedback from you and anyone else who chimes in (makes it easier to ship a working version more quickly.) Then over time, we let people configure what those SLAs mean for them to support more use cases.

noahtalerman commented 2 years ago

I think from our perspective ideally we'd just give you metrics, thresholds and weights and the algorithm should spit out a risk score

@rymurph20, this makes a lot of sense.

It seems like the aggregate risk score is really good at bubbling up the "riskiest" software/hosts. This seems helpful for answering the "What action can I take to make the biggest impact at reducing risk now?"

However, as discussed by Mike and Jason above, it seems like the risk score on its own doesn't seem to help the questions of "Which software/hosts can we wait to update until later this week or later this month?"

I could be very wrong about the above^

For now, in an attempt to address both of the above questions, we're taking the approach of adding vulnerable software versions into "Year," "Month," "Week," and "Day" categories:

Then, Fleet is poised to present the above "Urgency" with additional context in the UI and API:

Critical: Danger is imminent from remote attackers; drop everything and fix. (e.g. Log4Shell) High: Vulnerability is trivial to exploit and should be prioritized but danger isn't necessarily imminent from remote attackers (e.g. LPE like Dirty Pipe, PolKit)

@rymurph20, this distinction is super helpful. The "Day" urgency in the Figma wireframes I link to below is intended to account for the "drop everything and fix" scenario.

The "Week" urgency is intended to account for the "should be prioritized"

noahtalerman commented 2 years ago

@cjwalton and @rymurph20 when you get the chance, please take a look at the following Figma wireframes to see how the "Urgency" concept and additional context could be presented in Fleet: https://www.figma.com/file/hdALBDsrti77QuDNSzLdkx/?node-id=4764%3A179433

What are your thoughts on using the criteria defined in the above comment for "Urgency" and being presented additional context instead of an aggregate risk score?

Please feel free to add any feedback as comments in this issue :)

Please note that these wireframes are subject to change and further iteration.

noahtalerman commented 2 years ago

Goals that might be addressed in a later iteration:

zwass commented 2 years ago

I've been learning about EPSS and I think we should strongly consider using it for prioritization of vulnerabilities.

There's some helpful discussion of how to present EPSS scores in https://www.first.org/epss/articles/prob_percentile_bins.

cjwalton commented 2 years ago

@zwass - I was not familiar with EPSS until I read your comment, but that is exactly how I think about how vulnerability management should be addressed. Thanks for surfacing this and yes - this is a super forward-thinking way that Fleet should look at this.

noahtalerman commented 2 years ago

@chiiph I'm passing this issue's assignment to you. Can you, or another engineering team member, please check out the feasibility for adding data like CVSS scores, known exploits, and EPSS scores to vulnerabilities in the GET /software API route ?

This way, product can then take this research and determine how we'd like to display this data.

Links to sources for this data are included in the "Data" section in this issue's description.

cc @zwass

noahtalerman commented 2 years ago

Heads up, I'm adding this issue to the LEGACY #g-product board so that the product team is aware that this research is in progress.

juan-fdz-hawa commented 2 years ago

@noahtalerman I put together a small doc talking about CVSS and EPSS - TLDR:

Feel free to DM if you have any questions.

noahtalerman commented 2 years ago

@juan-fdz-hawa thank you for putting together that doc!

Even though EPSS scores make more sense and seem to be the future, CVSS scores still are the 'industry' standard, so we might want to use both.

We'd like Fleet to help the user determine what software is the most vulnerable.

This way, a Fleet user can patch the most vulnerable software first to achieve the goal of maintaining secure and compliant devices.

EPSS was created as a way to quantify the risk of a vulnerability so that it can be better prioritized

I think this means that determining a helpful way to surface the EPSS score will be more valuable than surfacing the CVSS score. @cjwalton and @rymurph20 what do you think about this?

For example, if we look at viruses in the USA, Ebola will probably have a high CVSS score, but a low EPSS score (because there are no cases in the USA)

This is an awesome analogy.

mikermcneil commented 2 years ago

Data point:

image
noahtalerman commented 2 years ago

Several takeaways following a conversation with a customer:

cc @cjwalton @rymurph20

noahtalerman commented 2 years ago

@zwass @chiiph @juan-fdz-hawa @michalnicp distilling the above feedback into a list of priorities here:

  1. Add EPSS scores to Fleet's vulnerability database
    • Unanswered question: Will all vulnerabilities (CVEs) have an EPSS score?
  2. Add CVSS scores to Fleet's vulnerability database

Can a member from the Platform team please file issues to track the above items? I'm happy to answer any questions or discuss the above before this happens.

A member of the interface team will be responsible for filing the issues that track UI+API changes to expose this data.

juan-fdz-hawa commented 2 years ago
  • Unanswered question: Will all vulnerabilities (CVEs) have an EPSS score?

@noahtalerman No, if I remember correctly the EPSS dataset contains around 173k scores, and there are currently around 185k CVEs

noahtalerman commented 2 years ago

Thanks! Do I read the buckets this way?

  • Year: (low priority)
    • At least one vulnerability (CVE) with low severity (CVSS score)

"The user can wait for at least a year until updating?"...

Correct.

This was a first stab at seeing if the Fleet product can enhance a common “service-level agreement (SLA)” practice we saw users/customers applying to vuln management.

Example of this practice: An organization wants to know generally how successful it was at updating/patching vulnerable software over the course of the year. Sometimes folks call this “time to remediation.” Often, this time to remediation differs according to characteristics of the vuln (how server or impactful is this vuln).

The thinking is, eventually, Fleet buckets vulnerable software into something like “day,” “week,” and “year” priorities so that Fleet is able to tell you what the average time to remediation is for all vulnerable software in the day, week, month, and year buckets.

So, Fleet helps answer, were all vulnerable software items bucketed under “week” actually remediated in a week? If not, how close was the organization to accomplishing this.

noahtalerman commented 2 years ago

Moving the following research out the issue's description:

Notes

There seem to be a software/vulnerability-first and a device-first approach to achieving the above.

In Q2 2022, Fleet will focus on improvements that address the software/vulnerability-first approach.

Software/vulnerability-first

Organizations with the resources for a robust process of managing vulnerabilities seem the have the following goals. These organizations are typically large organizations with tens to hundreds of thousands of devices.

As a Fleet user, I want to...

Device-first

The organizations that don't yet have the resources for a robust process of managing vulnerabilities seem the have the following goals. These organizations are typically small to medium sized organizations.

As a Fleet user, I want to...

Data

CVSS scores + known exploits

One way to determine priority of updating installed software is by combining CVSS scores (available in NVD) and known exploit data.

Buckets

EPSS score

Another way to determine priority of updating installed software is by using EPSS scores.

Buckets
zhumo commented 2 years ago

@noahtalerman Can you add to the vulnerability processing page in the docs mention that we now have CVSS EPSS and link to them?