Time variance of an EPSS score for a CVE

Crashedmind commented 6 months ago

Description, Use Case and User Stories There is signal in the temporal/dynamic aspects of EPSS (in addition to the threshold approach as outlined in the guide.) So we want to provide a user-centric guide on how to use this signal - and the associated use cases e.g. what's fast incoming? e.g. threat hunting, threat intelligence,...

Definition of Ready

The people who will lead this effort are identified and interested and committed.
There's a rough plan agreed.

Acceptance Criteria

User scenarios are defined for this EPSS signal
Users who represent the user scenarios are identified and provide user feedback on the chapter
The chapter will be presented at the EPSS SIG for socialization and feedback.

Additional context Several articles have been published e.g.

maorkuriel commented 6 months ago

I am suggesting calculating the time variance of an EPSS score for a CVE:

Initial EPSS Score (EPSS_initial): The EPSS score was assigned to the CVE when first assessed.
Updated EPSS Score (EPSS_updated): The EPSS score assigned to the CVE after subsequent assessments or changes in the vulnerability landscape.
Time Variance (TV): The change in the EPSS score over time.

TV=EPSS_updated−EPSS_initial

Positive values indicate an increase in the EPSS score, suggesting that the vulnerability's exploitability or severity has heightened over time. Negative values indicate a decrease in the EPSS score, indicating improvements in mitigation measures or a decrease in the perceived risk associated with the vulnerability.

FasIte commented 4 months ago

Rudy here. Gotcha:

slow for the daily flowing, "raw" eps signal (the sludge)
- fast for whatever cues can be extracted that predict up-and-comers that would hit hard

On the horizon, possibly maybe, the existence of exploitability lifecycles in the EPSS signal.

labyrinthinesecurity commented 4 months ago

Christophe here! I would like to share some initial rough findings, based on my talks with Rudy and a big data extraction (1 year of EPSS).

Here's the idea: try to leverage time variance to anticipate EPSS scores moving upwards as early as possible, to get some lead time. We start from the set of all CVEs which enjoy a signal change at some time in their history (call it t1), and then 5 days later (call it t1+5). Call t0 is the first time the CVE was tracked by EPSS.

Now, with the knowledge of t0,t1 and t1+5 we want to predict if the CVE will reach a certain EPSS score at t1+30. We are not looking for a specific numerical EPSS score, but for a 'color' category.

What's interesting here, is that if can can make a prediction on the color, then we will have a lead time of up to 25 days (the time between t1+5 and t1+30).

We train XGBoost to predict a category for any CVE based on t0, t1 and t1+5. The categories are: green, yellow, orange and red. 'red' means that the CVE will reach an EPSS score of at least 0.07 at t1+30, given t1+5 greater than 0.055 'orange' means it will reach 0.07 at t1+30 given t1+5 lower than 0.055 'yellow' means it the score at t1+30 will be below 0.07 but above 0.01 'green' means the score will be below 0.01

Why these numbers? Because for now these seem to make good predictions.

Here are the first initial results based on a sample of about 1 year of CVEs timeseries:

Green count: 40 (sampled from 16623 green CVEs with at least a t1 and a t1+5) Red count: 43 Yellow count: 40 Orange count: 40

Accuracy: 0.6363636363636364 Precision: 0.637190082644628 Recall: 0.6363636363636364 F1-score: 0.6335169880624426

Confusion Matrix: green yellow orange red green 4 0 2 1 yellow 1 3 1 1 orange 1 0 7 1 red 1 2 1 7

You will see that we can pick some non-obvious predictions for orange and red: Samples well classified as orange: [0, 9, 15, 19, 23, 25, 28] 0 0.0006,0.0037,5.0 9 0.00043,0.00917,1.0 15 0.0006,0.0037,5.0 19 0.0006,0.0037,5.0 23 0.0006,0.0037,5.0 25 0.00043,0.00091,1.0 28 0.0006,0.0037,5.0 Samples well classified as red: [7, 13, 17, 18, 21, 24, 31] 7 0.01803,0.62318,4.0 13 0.00144,0.08866,9.0 17 0.00043,0.00144,1.0 18 0.00205,0.07544,11.0 21 0.00229,0.86724,7.0 24 0.00237,0.784,5.0 31 0.00237,0.12129,2.0

Notes: 1) The F1 score is not fantastic but the correlation matrix for the red and orange categories, the two most interesting ones, doesn't look too bad.

2) There are very, very few signal changes (sparse info as Rudy calls it) even over a one year observation period... this prevents good predictions.

Crashedmind commented 4 months ago

thanks @labyrinthinesecurity / Christophe!

some initial thoughts / references:

since we are looking at a time series and predictive modeling, should we use such models over XGBoost e.g.
- https://github.com/FIRSTdotorg/Vuln4Cast has code to do this and associated paper for a related use case. Worth a read for background.
- https://www.amazon.com/Python-Finance-Cookbook-effective-financial/dp/1803243198/ is a book I've read / used for a different context - but it is a really good resource to get going with time-series models
Is volatility a useful metric or thing to measure?
- Mentioning this as the thought came to my mind as I'm familiar with this (again from a financial context).
- again this is covered in the book mentioned above.

The backdrop to my thoughts is that Financial analysis is a mature science (that I've some experience with), and so the models and indicators used there may be relevant - but either way a good place to start to see what's done there.

Also, see https://www.first.org/events/colloquia/cardiff2023/. Something useful/relevant may have come out of the FIRST Vulnerability Forecasting Technical Colloquium

labyrinthinesecurity commented 4 months ago

Thanks for the link to Vuln4Cast paper, I will read it, it's on my' iPad now.

Beyond ML prediction models, I think we have two solid options: 1/ volatility as you said, is well worth exploring at this stage, even if we dont have that many datapoints. I will explore it next week in parallel to my XGboost experiments when I get back to work. 2/ something that could help Rudy: modeling bracket transitions in the crown as a hidden markov model, it's quite easy to implement with a numpy n*n matrix, where n is the number of brackets.

Thoughts?

Le mar. 7 mai 2024 à 20:00, Crashedmind @.***> a écrit :

thanks @labyrinthinesecurity https://github.com/labyrinthinesecurity / Christophe!

some initial thoughts / references:

since we are looking at a time series and predictive modeling, should we use such models over XGBoost e.g.

https://github.com/FIRSTdotorg/Vuln4Cast has code to do this and associated paper for a related use case. Worth a read for background.

https://www.amazon.com/Python-Finance-Cookbook-effective-financial/dp/1803243198/ is a book I've read / used for a different context - but it is a really good resource to get going with time-series models

Is volatility a useful metric or thing to measure?

Mentioning this as the thought came to my mind as I'm familiar with this (again from a financial context).

again this is covered in the book mentioned above.

The backdrop to my thoughts is that Financial analysis is a mature science (that I've some experience with), and so the models and indicators used there may be relevant - but either way a good place to start to see what's done there.

— Reply to this email directly, view it on GitHub https://github.com/RiskBasedPrioritization/RiskBasedPrioritization.github.io/issues/6#issuecomment-2099005998, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWOSWDRYFLG4N4JFX63BOGTZBEJEDAVCNFSM6AAAAABDKVLM76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJZGAYDKOJZHA . You are receiving this because you were mentioned.Message ID: <RiskBasedPrioritization/RiskBasedPrioritization.github. @.***>

Crashedmind commented 4 months ago

for 2/ sure - HMM is worth playing with - simple to get going and learning.

Here's a tutorial with code for HMM with 3 hidden states corresponding to changes of low, medium and high volatility https://medium.com/@natsunoyuki/hidden-markov-models-with-python-c026f778dfa7 (again financial context 😄 )

FasIte commented 4 months ago

The stock approach honestly I don't see (in no way it means it's incorrect): it's noisy, high-frequence where epss is quite slow, and low-frequency, almost regular; plus, it's normalized. Volatility captures the fundamental difference:

makes a lot of sense for a noisy, "frenetic" signal that changes from one day to another (downto every second it seems). Based on standard deviation, volatility has lots of datapoints to leverage and have informational value, and turn into a predictor.
where epss is concerned, 17 out of 200k vulns have more than 10 changes over a 6m window, i.e. 10+ data points on which to compute std and volatility.

Come to think of it, "volatility" looks like a rich man's tool (pun intended); EPSS is super dry comparatively.

FasIte commented 4 months ago

Still going at it descriptively, with a simplified "crown", reducing the signal in 3 phases, defined as whether or not there is more chance for the signal to Move or to Stay. Thresholds appear at .17/.83 for crossing the 50% mark.

The crown (light)

Second, the speed (Garrity asking about it, he's not here, right?): average amount of change (dEps) for positive changes (green), negative ones (red), and their balance (sum; in blue). Theoretically, you'd expect something sinusoidal (a perfect pendulum; in black): balance would be positive in the lower range of eps (0-50%) and negative in its higher (50-100%). It's almost the case:

green is indeed higher in the lower range (extended to 65% still)
red tends to be higher in the higher range, but is all broken
and actually greater than green at low regimes too (30%, 40%)

speeds

Meaning:

it's becoming more difficult to climb up than to go down above 30%.
acceleration remains constant until 70% (the green plateau starting at 10).
deceleration is a bit crazier but lasts downto 25%, starting at 85%.

Note it does not take into account the probability of moving up or down. And not so new actually, it was already present in EPSS Part 3: less negative changes than positives, but they pack more punch.

FasIte commented 4 months ago

Starting to think about going predictive, using the above-mentioned thresholds:

collecting datapoints until reaching .17 (alarm-raising)
testing how good it predicts reaching .83 (validation)

with a feature vector of <#changes, sequence of changes, sequence of intervals>, something like that

labyrinthinesecurity commented 4 months ago

I like the approach! Reaching 0.83 in more than one step is going to be super rare however, and I think a single step does’t bring a lot of’actionable insight(?)

I second you on volatility, it's going to be hard with so few data points in each series. will try something anyway

Le mer. 8 mai 2024 à 15:03, FasIte @.***> a écrit :

Starting to think about going predictive, using the above-mentioned thresholds:

collecting datapoints until reaching .17 (alarm-raising)

testing how good you predict it reaching .83 with a feature vector of <#changes, sequence of changes, sequence of intervals>, something like that

— Reply to this email directly, view it on GitHub https://github.com/RiskBasedPrioritization/RiskBasedPrioritization.github.io/issues/6#issuecomment-2100530178, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWOSWDSMJ6JSYY3TCSLXH5LZBIPDBAVCNFSM6AAAAABDKVLM76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBQGUZTAMJXHA . You are receiving this because you were mentioned.Message ID: <RiskBasedPrioritization/RiskBasedPrioritization.github. @.***>

FasIte commented 3 months ago

Sure they will be rare those that reach .83... but then the rest does not ;) It's also meaningful

The idea is to classify a CVE reaching .83 (tau2) based on data collected at onset until reaching .17 (tau1). Some will, some won't, is there any information in onset data that enables us to predict it? How accurate is the classifier? 10% - bad? 50% - meh? 80% - nice?

Then it becomes an optimization approach on the two parameters to minimize the classifying error, using grid search typically: how soon (tau1, reactivity) can we start predicting it will reach a given value (tau2, criticality?), i.e. with what confidence? Varying tau1 (.1, .17, .2, etc...) and tau2 (>= tau1, .5, .8, .9,...), and measuring the classifying error will tell us how much information the onset contains and how good of an operational mechanism we have here, to tell users when to track and consider a CVE as a strong hitter, w/o additional security intel, just by observing/computing the raw EPSS signal...

Makes sense?

labyrinthinesecurity commented 3 months ago

I have some interesting news to share: after many disappointing models training (XGBoost, Deep learning), I have found a very simple heuristic which seems to give us some lead time:

I have noticed that, 15 days after an EPSS score raised for the first time (if it raises at all!), two situations may occur:

if this score is greater than 0.05, then the long-term EPSS score will be high (greater than 0.5) with significant probability (greater than 25%), many of which are not obvious (the long term score is much higher than 0.5)
if not (this score is lower than 0.05), then the long-term EPSS score will nearly always be low with very high probability (greater than 99.9%)

By long term, I mean in two month time.

The good news is that the CVEs which are prone to rise are not that numerous, they can be reviewed manually. They show up every 3 days or so.

This needs to be further refined, but it's the most useful "signal" I've found in temporal variance so far.

labyrinthinesecurity commented 3 months ago

It makes sense, but I fear that you will have only a couple of samples in the upper categories even if you scan a whole year or epss scores. synthetic data wont be of much help also.

Le lun. 13 mai 2024 à 10:01, FasIte @.***> a écrit :

Sure they will be rare those that reach .83... but then the rest does not ;) It's also meaningful

The idea is to classify a CVE reaching .83 based on data collected at onset until reaching .17. Some will, some won't, is there any information in onset data that enables us to predict it? How accurate is the classifier? 10% - bad? 50% - meh? 80%

nice?

Then it becomes an optimization approach on the two parameters to minimize the classifying error, using grid search typically: how soon (tau1, reactivity) can we start predicting it will reach a given value (tau2, _criticality), i.e. with what confidence? Varying tau1 (.1, .17, .2, etc...) and tau2 (>= tau1, .5, .8, .9,...), and measuring the classifying error will tell us how much information the onset contains and how good of an operational mechanism we have here, to tell users when to track and consider a CVE as a strong hitter, w/o additional security intel, just by observing/computing the raw EPSS signal...

Makes sense?

— Reply to this email directly, view it on GitHub https://github.com/RiskBasedPrioritization/RiskBasedPrioritization.github.io/issues/6#issuecomment-2106902613, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWOSWDTILIKDBGBQ53QSRA3ZCBXNXAVCNFSM6AAAAABDKVLM76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBWHEYDENRRGM . You are receiving this because you were mentioned.Message ID: <RiskBasedPrioritization/RiskBasedPrioritization.github. @.***>

FasIte commented 3 months ago

En effet Christophe... Got the data up to today to gather one-year worth of samples but nope: the NVD trainwreck hits here too and breaks data continuity. Considering going backwards in time from [edit: March 7th], 2023 and the v3 release.

June 1st to Dec 31st: 14k CVE with their onset in the period, only 68 with 10+ changes, 10 of which reach .83 eventually. Super skinny.

Instead of classifying "reaching .83 at whatever point", I'm gonna look at "how accurately to predict the next value based on current". Gonna be quite tied up in the upcoming days though

labyrinthinesecurity commented 3 months ago

I completely agree! for historial records we are heavily constrained by two hard limits: EPSS v3 from 8th March 2023 and the NVD affair… this doesnt help the collection of significant amount of samples for prediction. As far Im concerned, Im trying to predict epss+60 days based on epss+0 days and as few other datapoints as possible, sampling from this narrow historical records window. Im not overly optimistic…

FasIte commented 3 months ago

I am a dwarf and I'm digging a hole..

Crashedmind commented 3 months ago

I was thinking to play with N-BEATS https://forecastegy.com/posts/multiple-time-series-forecasting-nbeats-python/#what-is-n-beats for financial context, but I could also apply it here for EPSS for fun to see what happens...

What's the best way to get the same dataset you're using?

FasIte commented 3 months ago

I'll need to setup my git correctly for me to PR. In the meantime here is the dataset (7th of March to 31st of December):

cveid
day (day 1 is 7th of March)
step (step 1 is onset, then step 2, etc.)
pre: epss before the change
post: epss after the change
diff: difference between pre and post

Reasonably sanitized/validated (no fullproof check, but verification of known trends and behaviors for specific CVEs - the most interesting ones).

Could use a column for amount of days between two changes, but that can be inferred

changes.0703-3112.csv.gz

labyrinthinesecurity commented 3 months ago

https://www.kaggle.com/datasets/labyrinthinesecurity/epss-timeseries

labyrinthinesecurity commented 3 months ago

(UPDATED)

What do you think of this "EPSS forecaster", guys? 1/ Asked to consider the 47300 CVEs between 2023-03-11 and 2024-05-09 from my Kaggle dataset 2/ Filter out all "uninteresting CVEs" where the score at time epss+16 is obviously high (i.e, greater than 0.66) or obviously low (lower than 0.1) => Only 67 CVEs remain 3/ Predict the score 14 days later (at epss+30) 4/ Calculate the 95% percentile interval of this score 5/ Keep only the CVEs where the upper bound of this interval is greater to or equal to 0.9

Outcome: 100% of true positive are consistently detected (that's 3 in 47300, a needle in a haystack!) 0 false negative (out of 47300 CVEs) 8 false positives (that's about one every two months, considering the extended period)

True positive = an epss+30 score which is greater than 0.65 and successfully forecast False positive = an epss+30 score which is lower than 0.65 and wrongly reported as higher than 0.66 False negative = an epss+30 score which is greater than 0.65 but missed by the forecaster

Forecaster result over extend period:

CVE-ID        Day16   Day30   95% confidence interval
CVE-2023-1698 0.59075 0.59767 (0.1870, 0.9712)
CVE-2023-33246 0.58077 0.58077 (0.1907, 0.9715)
CVE-2023-34960 0.62253 0.81793 (0.4064, 0.9424)  **TRUE POSITIVE**
CVE-2022-39986 0.60136 0.60136 (0.4801, 0.9474)
CVE-2023-36761 0.57125 0.57125 (0.1849, 0.9679)
CVE-2023-36028 0.45995 0.47939 (0.0153, 0.9193)
CVE-2023-49103 0.51754 0.51754 (0.2766, 0.9344)
CVE-2023-6553 0.45921 0.90901 (0.0166, 0.9388)  **TRUE POSITIVE**
CVE-2023-48795 0.43479 0.65657 (0.0121, 0.9232)  **TRUE POSITIVE**
CVE-2024-21644 0.44017 0.44017 (0.0142, 0.9419)
CVE-2024-23897 0.6151 0.41536 (0.4272, 0.9356)

FasIte commented 3 months ago

The angle is interesting, the 100% Recall is great, the 37.5% Precision :/

This is with the linear regression, still?

labyrinthinesecurity commented 3 months ago

Thanks for the feedback Rudy! For the precision, I have strong doubts that we will ever reach a satisfactory result ;)

No, I have given up with linear regression (even more false positives).

Here is the recipe:

Build a 2D matrix of probabilities (a Hidden Markov Model to capture daily changes by scanning all CVE changes over the period)
For each CVE to predict, run 1000 Monte Carlo simulations of 14 day evolutions using the matrix and calculate the 95% percentile interval for the 1000 runs

labyrinthinesecurity commented 3 months ago

I have improved the forecaster a little bit. The lead time, which used to be 14 days (from epss+16 to epss+30) is now 22 days (from epss+8 to epss+30). Also, I have tried to use it to make earlier predictions without much success so far.

The Precision is still not very good, but it's much better than my best XGboost models.

FasIte commented 3 months ago

Getting at it with ChatGPT4o. Feeling like a dwarf with a powerdrill...

labyrinthinesecurity commented 3 months ago

Hello guys, I've made a private github repository for my (preview) model and sent you an invite if you want to take a look/contribute. Any comment welcome!

FasIte commented 2 months ago

Sorry for the silence, have been underground but here reporting:

the powerdrill turned into an actual blueprint (managed to run some clustering on ChatGPT infra before having "sorry, technical issues, here is a piece of code" messages...)
- doubled back on the dataset to extend it: could only barely do so, as Jan-2024 forward was meh (in relation with NVD issues I guess)
- ran a quickie on "score right before change" from increasing awareness of the signal abruptness, incongruent with epss-only prediction (where @labyrinthinesecurity approach makes sense, introducing additional datapoints)
- went back to the drawing board for signal analysis, characterizing the behavior at macro scale, and identifying a possible usecase for TI analysts: look at CVEs passing pinging at [0.55 - 0.75] as the most worthy of your attention for up-and-comers ([0.15 - 0.35] as a close second) https://www.linkedin.com/pulse/epss-entropy-continental-rudy-guyonneau-phd-aitgf/

FasIte commented 2 months ago

About the quickie: "eps before crossing the X threshold" (figure below for X=0.9) From my perspective - EPSS only, signal is too abrupt for accurate modeling:

25% of the time, it springs beyond 0.9 from below 0.1;
50% of the time, from below 0.4;
- the trend is observed for all thresholds below (except at 0.7, where it is actually worse: 100% of those crossing 0.7 do so from below 0.4).

This, considered alongside the relative scarcity of datapoints, and I'm like: gonna dig a hole somewhere else.

FasIte commented 2 months ago

Off on a 3-week break after tomorrow by the way

Crashedmind commented 2 months ago

Hello guys, I've made a private github repository for my (preview) model and sent you an invite if you want to take a look/contribute. Any comment welcome!

Hi Christophe- can you resend the invite - just cathing up on this now as I was away... thanks! Chris

labyrinthinesecurity commented 2 months ago

Hi Christophe- can you resend the invite - just cathing up on this now as I was away... thanks! Chris

The private repo was obsolete so I deleted it => I am in the process of making a new repository with the latest improvements discussed in my newsletter on linkedin. Will be off for the next two weeks or so, I will work on that new repo when I'm back :)

Crashedmind commented 1 month ago

@FasIte @labyrinthinesecurity looking back at the original ticket description, we should make the great analysis you've both done into "a user-centric guide on how to use this signal" - and do the items listed in the Acceptance Criteria.

Let me know if you agree or not, or in general how you think we can make the learnings useful and useable to users.

FasIte commented 3 weeks ago

@Crashedmind sending you a User Scenario by email (CyberDefender: Identify Upcoming Vulnerabilties). Let me know if that would fit (or not).

Comment: the Narrative is constrained by the current model used to produce the EPSS score. It might be, with forthcoming releases (v4 in mind), that the attention range moves around, if not simply disappear. But the method used and described in the LI articles would allow to verify and redefine them, even increasing the resolution of eps values, if sample size permits.

RiskBasedPrioritization / RiskBasedPrioritization.github.io

Time variance of an EPSS score for a CVE #6