Closed malexmave closed 2 years ago
@malexmave did you configure the Generic parser de-duplication configuration to DEDUPE_ALGO_UNIQUE_ID_FROM_TOOL
?
Like for example:
https://github.com/DefectDojo/django-DefectDojo/blob/a2f75d4df79b770af8c02a07a324529e2aa958c3/dojo/settings/settings.dist.py#L1113
As you are filling the attribute unique_id_from_tool
, it could be better for your use case. It will fix the de-duplication in many use case.
I did not change anything about the configuration, it happens if I pull the current main branch of the repo and set everything up unchanged. Checking the settings.dist.py, it looks like there is no explicit strategy defined for the semgrep parser, so it seems to use whatever the default strategy is (which appears to be problematic).
I do not quite understand your question: Is your question about if I made changes that may explain why it is broken, or about if the change you are proposing is fixing the issue? If it is the latter, I can give it a try tomorrow.
If you are using this file for your tests https://gist.github.com/malexmave/a2f1b8d266e8ce6dadb7c31859be16d9 you can try to configure the de-duplication for the Generic parser to DEDUPE_ALGO_UNIQUE_ID_FROM_TOOL
.
So something like that to add to dojo/settings_dist.py
or (better) to your local settings file local_settings.py
:
DEDUPLICATION_ALGORITHM_PER_PARSER ['Generic Findings Import'] = DEDUPE_ALGO_UNIQUE_ID_FROM_TOOL
There are no default settings for the Semgrep parser for deduplication. It's an easy fix which I will submit later.
But we have to discuss if we have to change the defaults.
I can confirm that adding DEDUPLICATION_ALGORITHM_PER_PARSER ['Generic Findings Import'] = DEDUPE_ALGO_UNIQUE_ID_FROM_TOOL
to the settings.dist.py fixes the import issue for the second example file using the generic importer, and that #5317 fixes the issue with the semgrep import. I am not sure if the first solution is one that we can easily recommend to all users of secureCodeBox though, since it may impact other imports differently (for example, if no unique ID from tool is set in them). Will have to think about how we can address that part.
@malexmave we will discuss that with other maintainer. Maybe the Generic parser need a special de-duplication algorithm that take into account the special case of providing unique_id_from_tool
attribute.
@valentijnscholten @StefanFl what do you think about this idea of having the de-duplication algorithm acting differently if we provide a value for unique_id_from_tool
attribute or not?
I don't think we have to do something special. It should produce good results for the generic parser when we set DEDUPLICATION_ALGORITHM_PER_PARSER
to DEDUPE_ALGO_HASH_CODE
. The standard hash_code algorithm uses ['title', 'cwe', 'line', 'file_path', 'description']
. This works in your example because the descriptions are different and will produce better results in general than what we have now.
Is this always happening, or only when deduplication is disabled? The reimport process currently is 100% based on the hashcodes of the findings. Maybe if the user has disabled deduplication, we shouldn't rely on the hashcode and just do a field by field comparison? Would mean extra code and it would be slower. Maybe by chosing a better default it will resolve the issue as well?
@valentijnscholten As far as I understand the code the re-import is not based on hashcodes. Line https://github.com/DefectDojo/django-DefectDojo/blob/a2f75d4df79b770af8c02a07a324529e2aa958c3/dojo/importers/reimporter/reimporter.py#L69 calls match_new_finding_to_existing_finding
, which uses the value of DEDUPLICATION_ALGORITHM_PER_PARSER
. If that is not set it uses the legacy algorithm which find findings by title, test and severity.
That's why I want to set the deduplication algorithm to hash_code for the generic parser for the moment and discuss in our next meeting what to do with the legacy deduplication algorithm.
Ah yes, forget my comment. Should have drunk a coffee first....
I am also facing this issue for "Detect-secrets Scan" test type. Adding the 'Detect-secrets Scan' : DEDUPE_ALGO_HASH_CODE
to the settings.dist.py could not help.
Hi @StefanFl @valentijnscholten I am also facing this issue for "Detect-secrets Scan". I tried adding the below possible settings in settings.dist.py, but still some findings are missing. It works fine with import-scan directly instead of reimport on empty test-case.
'Detect-secrets Scan' : DEDUPE_ALGO_HASH_CODE
'Detect-secrets Scan' : DEDUPE_ALGO_UNIQUE_ID_FROM_TOOL_OR_HASH_CODE
'Detect-secrets Scan' : DEDUPE_ALGO_UNIQUE_ID_FROM_TOOL
From the detect-secrets parser, I can see the below code for calculating the hashcode key. The file, line and hashed_secret attribute values are different for each finding in my report.
dupe_key = hashlib.sha256( (type + file + str(line) + hashed_secret).encode('utf-8') ).hexdigest()
Thanks.
Hi @BhaskiBuzz, it works with 'Detect-secrets Scan': DEDUPE_ALGO_HASH_CODE
, I have tested it and wrote a PR for it.
Hi @StefanFl, I am using the v. 2.2.1 ( release mode ). It is not working for me with the addition of above setting in settings.dist.py. Do I need to upgrade the version?
I am pretty sure it will work in 2.2.1 as well. Can you post your setting with a few lines before and after?
Hi @StefanFl, Below is the setting
'SpotBugs Scan': DEDUPE_ALGO_HASH_CODE,
'JFrog Xray Unified Scan': DEDUPE_ALGO_HASH_CODE,
'Scout Suite Scan': DEDUPE_ALGO_HASH_CODE,
'AWS Security Hub Scan': DEDUPE_ALGO_UNIQUE_ID_FROM_TOOL,
'Meterian Scan': DEDUPE_ALGO_HASH_CODE,
'Github Vulnerability Scan': DEDUPE_ALGO_UNIQUE_ID_FROM_TOOL,
'Detect-secrets Scan': DEDUPE_ALGO_HASH_CODE
}
DUPE_DELETE_MAX_PER_RUN = env('DD_DUPE_DELETE_MAX_PER_RUN')
DISABLE_FINDING_MERGE = env('DD_DISABLE_FINDING_MERGE')
TRACK_IMPORT_HISTORY = env('DD_TRACK_IMPORT_HISTORY')
=> Below is the results file content which I am trying to upload. "import-scan" is creating 2 findings but the "reimport-scan" on empty test case creating only 1 finding.
{ "version": "1.1.0", "plugins_used": [ { "name": "ArtifactoryDetector" }, { "name": "AWSKeyDetector" }, { "name": "AzureStorageKeyDetector" }, { "name": "Base64HighEntropyString", "limit": 4.5 }, { "name": "BasicAuthDetector" }, { "name": "CloudantDetector" }, { "name": "GitHubTokenDetector" }, { "name": "HexHighEntropyString", "limit": 3.0 }, { "name": "IbmCloudIamDetector" }, { "name": "IbmCosHmacDetector" }, { "name": "JwtTokenDetector" }, { "name": "KeywordDetector", "keyword_exclude": "" }, { "name": "MailchimpDetector" }, { "name": "NpmDetector" }, { "name": "PrivateKeyDetector" }, { "name": "SendGridDetector" }, { "name": "SlackDetector" }, { "name": "SoftlayerDetector" }, { "name": "SquareOAuthDetector" }, { "name": "StripeDetector" }, { "name": "TwilioKeyDetector" } ], "filters_used": [ { "path": "detect_secrets.filters.allowlist.is_line_allowlisted" }, { "path": "detect_secrets.filters.heuristic.is_indirect_reference" }, { "path": "detect_secrets.filters.heuristic.is_likely_id_string" }, { "path": "detect_secrets.filters.heuristic.is_lock_file" }, { "path": "detect_secrets.filters.heuristic.is_not_alphanumeric_string" }, { "path": "detect_secrets.filters.heuristic.is_potential_uuid" }, { "path": "detect_secrets.filters.heuristic.is_prefixed_with_dollar_sign" }, { "path": "detect_secrets.filters.heuristic.is_sequential_string" }, { "path": "detect_secrets.filters.heuristic.is_swagger_file" }, { "path": "detect_secrets.filters.heuristic.is_templated_secret" } ], "results": { "file1.txt": [ { "type": "Secret Keyword", "filename": "file1.txt", "hashed_secret": "45a9584a5a6980909b5bf031b11b9f92c48c56cb", "is_verified": false, "line_number": 1, "description" : "secret=abcd" } ], "file2.txt": [ { "type": "Secret Keyword", "filename": "file2.txt", "hashed_secret": "0040f45fc44a306157299955f97a7437a7956162", "is_verified": false, "line_number": 1, "description" : "secret=xyz" } ] }, "generated_at": "2021-11-23T07:36:45Z" }
Your file worked for me, I have a re-import with 2 findings from your scan. Maybe we have done something about deduplication inbetween and it might help upgrading to the most recent version.
hi @StefanFl I can confirm that its working fine for me after upgrade to the latest version. Thanks.
Great
@BhaskiBuzz it seems upgrading had fixed your issue. closing it.
Description
I have a semgrep results JSON with 9 findings. Depending on the way I import it into DefectDojo, not all findings are imported correctly.
Steps to Reproduce
All instructions assume a fresh standard dockerized install without any changes to the settings.
Method 1, which will give all findings:
Method 2, which will give incomplete results:
Note that the results JSON contains 9 findings based on 5 rules - it seems like some sort of deduplication logic takes into account the used rule, but not the place where the rule matched. Thus, matches of the same rule in different places in the code are not recognized as distinct. However, you will find that the results are not even marked as "duplicate findings" or anything like that, they are simply discarded.
Expected behavior
I would expect both methods to give the same results.
Deployment method
Environment information
Additional context
This problem is not exclusive to the semgrep scan type. We found it because the secureCodeBox DefectDojo integration uses the second workflow: It creates product, engagement and test and then runs a re-import instead of doing a regular import, because this lets it set a description of the test (which is not possible otherwise, it seems). For secureCodeBox scan types that are not supported by DefectDojo, it uses the
generic
parser. When running a git-repo-scanner job and importing the results, we get the same result: The findings file contain four repositories, which are correctly recognized on a full import (Method 1), but only one is recognized on the reimport (Method 2).