Closed ziadhany closed 2 weeks ago
it would be really great to see a review of this PR soon!
@ziadhany this generally looks good to me, can we have some logs for the pipeline and some UI and API screenshots so we can merge it.
@TG1999 @keshav-space Thank you for reviewing this PR. Here are the pipeline logs:
/home/ziad-hany/PycharmProjects/vulnerablecode/venv/bin/python /home/ziad-hany/PycharmProjects/vulnerablecode/manage.py improve localhost:8000 --all
Improving data using compute_package_risk
INFO 2024-10-28 13:10:07.074 Pipeline [ComputePackageRiskPipeline] starting
INFO 2024-10-28 13:10:07.074 Step [add_risk_package] starting
INFO 2024-10-28 13:10:07.755 Calculating risk for 84,396 affected package records
INFO 2024-10-28 13:14:57.360 Progress: 10% (8440/84396) ETA: 2601 seconds (43.4 minutes)
INFO 2024-10-28 13:17:30.519 Progress: 20% (16880/84396) ETA: 1768 seconds (29.5 minutes)
INFO 2024-10-28 13:19:28.477 Progress: 30% (25319/84396) ETA: 1307 seconds (21.8 minutes)
INFO 2024-10-28 13:22:06.047 Progress: 40% (33759/84396) ETA: 1076 seconds (17.9 minutes)
INFO 2024-10-28 13:25:15.062 Progress: 50% (42198/84396) ETA: 907 seconds (15.1 minutes)
INFO 2024-10-28 13:30:57.706 Progress: 60% (50638/84396) ETA: 833 seconds (13.9 minutes)
INFO 2024-10-28 13:36:04.109 Progress: 70% (59078/84396) ETA: 667 seconds (11.1 minutes)
INFO 2024-10-28 13:39:06.350 Progress: 80% (67517/84396) ETA: 434 seconds (7.2 minutes)
INFO 2024-10-28 13:42:57.214 Progress: 90% (75957/84396) ETA: 219 seconds (3.6 minutes)
INFO 2024-10-28 13:48:50.555 Progress: 100% (84396/84396)
INFO 2024-10-28 13:48:50.694 Successfully added risk score for 84,396 package
INFO 2024-10-28 13:48:50.721 Step [add_risk_package] completed in 2324 seconds (38.7 minutes)
INFO 2024-10-28 13:48:50.721 Pipeline completed in 2324 seconds (38.7 minutes)
Process finished with exit code 0
@ziadhany @TG1999 Could we make a push on this one as this is a key addition for the VEX support in DejaCode. Also, make sure it is available everywhere in the relevant API endpoint as we need to capture this data.
@tdruez @keshav-space, I've completed the requested changes and I hope we can move forward and merge this PR.
Additionally, please review the weight_config.json
file to ensure an accurate view of the risk. The default weight is set to one
and may need adjustment to reflect typical risk values if we plan to use it in Dejacode.
note that @keshav-space is going to create a list of unique sources for the severity score, which we will use to expand the entries in the weight-config.json file. Please put suggested weights in the file. We probably also need a default.
list of unique sources for the severity score,
@ziadhany
👇 Below is the list of unique source for severity score in our public VCIO. From https://nvd.nist.gov
we have 339,765 severity scores, from https://api.first.org
we have 261,109 severity scores and so on. Now for each of these different source we need assign weight.
Below is the list of unique source for severity score in our public VCIO. From
https://nvd.nist.gov
we have 339,765 severity scores, fromhttps://api.first.org
we have 261,109 severity scores and so on. Now for each of these different source we need assign weight.
From the list in https://github.com/aboutcode-org/vulnerablecode/pull/1593#issuecomment-2457749729 I think it is difficult to assign a weight based on a heuristic, but let's try!
the weight could be derived for now from:
the number of entries: more likely means this is more authoritative. We should have a few "bands": 10 vs. 50 score entries for a source is not significant and 200K vs. 1 score entries for a source does not mean that the more prolific source is 200K times better that the one providing only one score.
the default should be 5/100 for now.
above 1000: 9/10
between 1000 and 500: 8/10
between 500 and 100: 7/10
between 100 and 30: 6/10
anything else, and unknown, default to: 5/10
(Note: but at the same time, less entries may means more specific, may be more accurate, so this is really a first shot, mostly guessing and random)
@pombredanne You're right; it does seem challenging to assign a weight based on a heuristic. We can start with this approach as a workaround, but even if we invest more time assigning it manually, it may still introduce bias. This is just the first shot, though.
And please let me know if anyone has any other ideas, and feel free to edit the weight configuration list as needed so I can use it in the code.
And please let me know if anyone has any other ideas, and feel free to edit the weight configuration list as needed so I can use it in the code.
@ziadhany this looks good to start.
@ziadhany please update the changelog. Also, IMO we should use weight_config.yml
instead of JSON file for configuration, like we do in our other projects.
Re: "IMO we should use weight_config.yml instead of JSON file for configuration, like we do in our other projects." I am fine with that alternative.
Re: "IMO we should use weight_config.yml instead of JSON file for configuration, like we do in our other projects." I am fine with that alternative.
But why having a config file? IMHO this should just be some Python data in a .py file. This is not meant to be updated by anyone but us for now.
I think that someone running their own installation of VulnerableCode should be able to adjust the weight assignments based on their own experience. The idea of a config file has been in the design document for quite a while now and we need to move things forward to get this done.
ok, let's go with .py if that is the simplest
We should keep local configuration in the plan but custom configuration would add unnecessary complexity at this early stage.
Thanks @ziadhany, merging this now!
@ziadhany or @keshav-space could you please post some screenshots here that show the display of the new values in VCIO? Thanks! Also please confirm that all 3 values will be available in the V2 API.
@DennisClark We store the value of risk_score
in a column and currently display it in the API. However, Exploitability
and Weighted Severity
are still missing. We already have the logic to calculate these metrics (Exploitability, Weighted Severity, and Risk), and we can easily implement it. However, we need to decide whether to store exploitability
and weighted_severity
in the database or calculate them dynamically with each API call. If we choose to store them, we must also determine whether to store them at the Vulnerability
level or the Package
level.
@ziadhany I think your should store the values of exploitability
and weighted_severity
at the same level as risk_score.
@ziadhany @DennisClark
If we choose to store them, we must also determine whether to store them at the Vulnerability level or the Package level.
Looking at https://github.com/aboutcode-org/vulnerablecode/pull/1593/files
I see 4 values being computed:
get_weighted_severity
get_exploitability_level
compute_vulnerability_risk
compute_package_risk
We need to clarify if those values are computed/defined for a Vulnerability entry of for a Package (combination of Vulnerabilities).
This matters a lot in the presentation of the data, depending on the context, one may view it through 1 Package, or through 1 vulnerability that affects many Packages.
At the moment, only the Package.risk_score
is stored and displayed (Package model and API).
I would suggest that the Vulnerability.risk_score
should also be stored and available in the API so the value can be used when looking at the list of Vulnerabilities.
Now, for weighted_severity
and exploitability_level
, looking at the code or at the design document, it is not clear if those values are qualifying a Package, a Vulnerability, or both.
@DennisClark @tdruez
Yes, we can store a vulnerability risk. However, weighted_severity
and exploitability_level
are specific to each vulnerability and are computed using the exploits
, references
, and severities
associated with it.
If we plan to store the risk_score for each vulnerability while also storing and displaying the weighted_severity
and exploitability_level
, it will lead to data duplication.
I suggest storing only the weighted_severity
and exploitability_level
for each vulnerability. This way, calculating the vulnerability risk
and package risk
becomes a straightforward mathematical operation.
However, if we don't mind some data duplication in exchange for better performance, we could store the following:
exploitability_level
, weighted_severity
, and vulnerability risk
in the vulnerability model
package risk
in the package model (already exists).
Please let me know exactly what needs to be done so we can quickly create a pull request for it.
If we plan to store the risk_score for each vulnerability while also storing and displaying the weighted_severity and exploitability_level, it will lead to data duplication.
@ziadhany could you clarify the "data duplication", maybe with an example? I'm not sure to get what would be duplicated here.
@tdruez Let's say we have two vulnerabilities, both affecting the package.
We are going to calculate the weighted_severity and exploitability_level forVulnerability 1
. We get values of 2 and 3, respectively.
We did the same for Vulnerability 2
and get 4 , 2.
The risk for Vulnerability 1: ( 2 3 ) = 6
The risk for Vulnerability 2: ( 4 2 ) = 8
The package risk is the maximum risk of the vulnerabilities affecting this package, which is 8
.
If we store the weighted_severity and exploitability_level, there’s no need to store the vulnerability risk (as it would be redundant data). Similarly, there’s no need to store the package risk since it can be easily calculated as the maximum of the vulnerability risks.
If we plan to store the risk_score for each vulnerability while also storing and displaying the weighted_severity and exploitability_level, it will lead to data duplication.
I still don't see the duplication storing the risk_score for each vulnerability.
The risk for Vulnerability 1: ( 2 * 3 ) = 6
and The risk for Vulnerability 2: ( 4 * 2 ) = 8
in your example: both have a different risk_score
that is useful to order and filter a Vulnerability listing outside a Package context.
If we store the weighted_severity and exploitability_level, there’s no need to store the vulnerability risk (as it would be redundant data).
Could you clarify this? Why is it redundant?
@tdruez I mean, if we create two columns for weighted_severity
and exploitability_level
, there’s no need to have a vulnerability_risk
column because it is the product of the two columns.
and there might be no need for a package risk column since it represents the maximum of the vulnerability risk
value that affect this package .
But, as you mentioned, it can be useful for ordering and filtering a vulnerability listing, and it can lead to better performance.
@ziadhany Alright, we should start by adding weighted_severity
and exploitability_level
as fields of the Vulnerability model in any case., right?
@ziadhany Alright, we should start by adding
weighted_severity
andexploitability_level
as fields of the Vulnerability model in any case., right?
Alright, I'll create a quick PR for this.
@ziadhany I think that the suggestions from @tdruez make more sense than my previous remarks about putting all 3 fields at the same level. Thanks for all the analysis.
issue: #1543