Reimporting scans on an empty test gives less results than full import

malexmave commented 2 years ago

Description

I have a semgrep results JSON with 9 findings. Depending on the way I import it into DefectDojo, not all findings are imported correctly.

Steps to Reproduce

All instructions assume a fresh standard dockerized install without any changes to the settings.

Method 1, which will give all findings:

Create a new product in DD
Select "Import Scan Results"
Fill out the form and upload the semgrep results file
Observe that all 9 findings are recognized

Method 2, which will give incomplete results:

Create a new product in DD
Create a new engagement for that product
Create a new test of type "semgrep" for that engagement
Select the test (it should be empty)
Select "Re-upload scan"
Upload the results JSON
Observe that only 5 findings are recognized

Note that the results JSON contains 9 findings based on 5 rules - it seems like some sort of deduplication logic takes into account the used rule, but not the place where the rule matched. Thus, matches of the same rule in different places in the code are not recognized as distinct. However, you will find that the results are not even marked as "duplicate findings" or anything like that, they are simply discarded.

Expected behavior

I would expect both methods to give the same results.

Deployment method

[x] Docker Compose
[ ] Kubernetes
[ ] GoDojo

Environment information

Operating System: Mac OS 11.6
DefectDojo version (see footer) or commit message: 2.3.1
Standard settings (no changes to settings.dist.py or the system settings from the defaults)

Additional context

This problem is not exclusive to the semgrep scan type. We found it because the secureCodeBox DefectDojo integration uses the second workflow: It creates product, engagement and test and then runs a re-import instead of doing a regular import, because this lets it set a description of the test (which is not possible otherwise, it seems). For secureCodeBox scan types that are not supported by DefectDojo, it uses the generic parser. When running a git-repo-scanner job and importing the results, we get the same result: The findings file contain four repositories, which are correctly recognized on a full import (Method 1), but only one is recognized on the reimport (Method 2).

damiencarol commented 2 years ago

@malexmave did you configure the Generic parser de-duplication configuration to DEDUPE_ALGO_UNIQUE_ID_FROM_TOOL? Like for example: https://github.com/DefectDojo/django-DefectDojo/blob/a2f75d4df79b770af8c02a07a324529e2aa958c3/dojo/settings/settings.dist.py#L1113

As you are filling the attribute unique_id_from_tool, it could be better for your use case. It will fix the de-duplication in many use case.

malexmave commented 2 years ago

I did not change anything about the configuration, it happens if I pull the current main branch of the repo and set everything up unchanged. Checking the settings.dist.py, it looks like there is no explicit strategy defined for the semgrep parser, so it seems to use whatever the default strategy is (which appears to be problematic).

I do not quite understand your question: Is your question about if I made changes that may explain why it is broken, or about if the change you are proposing is fixing the issue? If it is the latter, I can give it a try tomorrow.

damiencarol commented 2 years ago

If you are using this file for your tests https://gist.github.com/malexmave/a2f1b8d266e8ce6dadb7c31859be16d9 you can try to configure the de-duplication for the Generic parser to DEDUPE_ALGO_UNIQUE_ID_FROM_TOOL.

So something like that to add to dojo/settings_dist.py or (better) to your local settings file local_settings.py:

DEDUPLICATION_ALGORITHM_PER_PARSER ['Generic Findings Import'] = DEDUPE_ALGO_UNIQUE_ID_FROM_TOOL

StefanFl commented 2 years ago

There are no default settings for the Semgrep parser for deduplication. It's an easy fix which I will submit later.

But we have to discuss if we have to change the defaults.

malexmave commented 2 years ago

I can confirm that adding DEDUPLICATION_ALGORITHM_PER_PARSER ['Generic Findings Import'] = DEDUPE_ALGO_UNIQUE_ID_FROM_TOOL to the settings.dist.py fixes the import issue for the second example file using the generic importer, and that #5317 fixes the issue with the semgrep import. I am not sure if the first solution is one that we can easily recommend to all users of secureCodeBox though, since it may impact other imports differently (for example, if no unique ID from tool is set in them). Will have to think about how we can address that part.

damiencarol commented 2 years ago

@malexmave we will discuss that with other maintainer. Maybe the Generic parser need a special de-duplication algorithm that take into account the special case of providing unique_id_from_tool attribute.

@valentijnscholten @StefanFl what do you think about this idea of having the de-duplication algorithm acting differently if we provide a value for unique_id_from_tool attribute or not?

StefanFl commented 2 years ago

I don't think we have to do something special. It should produce good results for the generic parser when we set DEDUPLICATION_ALGORITHM_PER_PARSER to DEDUPE_ALGO_HASH_CODE. The standard hash_code algorithm uses ['title', 'cwe', 'line', 'file_path', 'description']. This works in your example because the descriptions are different and will produce better results in general than what we have now.

valentijnscholten commented 2 years ago

Is this always happening, or only when deduplication is disabled? The reimport process currently is 100% based on the hashcodes of the findings. Maybe if the user has disabled deduplication, we shouldn't rely on the hashcode and just do a field by field comparison? Would mean extra code and it would be slower. Maybe by chosing a better default it will resolve the issue as well?

StefanFl commented 2 years ago

@valentijnscholten As far as I understand the code the re-import is not based on hashcodes. Line https://github.com/DefectDojo/django-DefectDojo/blob/a2f75d4df79b770af8c02a07a324529e2aa958c3/dojo/importers/reimporter/reimporter.py#L69 calls match_new_finding_to_existing_finding, which uses the value of DEDUPLICATION_ALGORITHM_PER_PARSER. If that is not set it uses the legacy algorithm which find findings by title, test and severity.

That's why I want to set the deduplication algorithm to hash_code for the generic parser for the moment and discuss in our next meeting what to do with the legacy deduplication algorithm.

valentijnscholten commented 2 years ago

Ah yes, forget my comment. Should have drunk a coffee first....

reddybhaskarvengala commented 2 years ago

I am also facing this issue for "Detect-secrets Scan" test type. Adding the 'Detect-secrets Scan' : DEDUPE_ALGO_HASH_CODE to the settings.dist.py could not help.

reddybhaskarvengala commented 2 years ago

Hi @StefanFl @valentijnscholten I am also facing this issue for "Detect-secrets Scan". I tried adding the below possible settings in settings.dist.py, but still some findings are missing. It works fine with import-scan directly instead of reimport on empty test-case.

'Detect-secrets Scan' : DEDUPE_ALGO_HASH_CODE 'Detect-secrets Scan' : DEDUPE_ALGO_UNIQUE_ID_FROM_TOOL_OR_HASH_CODE 'Detect-secrets Scan' : DEDUPE_ALGO_UNIQUE_ID_FROM_TOOL

From the detect-secrets parser, I can see the below code for calculating the hashcode key. The file, line and hashed_secret attribute values are different for each finding in my report.

dupe_key = hashlib.sha256( (type + file + str(line) + hashed_secret).encode('utf-8') ).hexdigest()

Thanks.

StefanFl commented 2 years ago

Hi @BhaskiBuzz, it works with 'Detect-secrets Scan': DEDUPE_ALGO_HASH_CODE, I have tested it and wrote a PR for it.

reddybhaskarvengala commented 2 years ago

Hi @StefanFl, I am using the v. 2.2.1 ( release mode ). It is not working for me with the addition of above setting in settings.dist.py. Do I need to upgrade the version?

StefanFl commented 2 years ago

I am pretty sure it will work in 2.2.1 as well. Can you post your setting with a few lines before and after?

reddybhaskarvengala commented 2 years ago

Hi @StefanFl, Below is the setting

'SpotBugs Scan': DEDUPE_ALGO_HASH_CODE,
'JFrog Xray Unified Scan': DEDUPE_ALGO_HASH_CODE,
'Scout Suite Scan': DEDUPE_ALGO_HASH_CODE,
'AWS Security Hub Scan': DEDUPE_ALGO_UNIQUE_ID_FROM_TOOL,
'Meterian Scan': DEDUPE_ALGO_HASH_CODE,
'Github Vulnerability Scan': DEDUPE_ALGO_UNIQUE_ID_FROM_TOOL,
'Detect-secrets Scan': DEDUPE_ALGO_HASH_CODE

}

DUPE_DELETE_MAX_PER_RUN = env('DD_DUPE_DELETE_MAX_PER_RUN')

DISABLE_FINDING_MERGE = env('DD_DISABLE_FINDING_MERGE')

TRACK_IMPORT_HISTORY = env('DD_TRACK_IMPORT_HISTORY')

=> Below is the results file content which I am trying to upload. "import-scan" is creating 2 findings but the "reimport-scan" on empty test case creating only 1 finding. { "version": "1.1.0", "plugins_used": [ { "name": "ArtifactoryDetector" }, { "name": "AWSKeyDetector" }, { "name": "AzureStorageKeyDetector" }, { "name": "Base64HighEntropyString", "limit": 4.5 }, { "name": "BasicAuthDetector" }, { "name": "CloudantDetector" }, { "name": "GitHubTokenDetector" }, { "name": "HexHighEntropyString", "limit": 3.0 }, { "name": "IbmCloudIamDetector" }, { "name": "IbmCosHmacDetector" }, { "name": "JwtTokenDetector" }, { "name": "KeywordDetector", "keyword_exclude": "" }, { "name": "MailchimpDetector" }, { "name": "NpmDetector" }, { "name": "PrivateKeyDetector" }, { "name": "SendGridDetector" }, { "name": "SlackDetector" }, { "name": "SoftlayerDetector" }, { "name": "SquareOAuthDetector" }, { "name": "StripeDetector" }, { "name": "TwilioKeyDetector" } ], "filters_used": [ { "path": "detect_secrets.filters.allowlist.is_line_allowlisted" }, { "path": "detect_secrets.filters.heuristic.is_indirect_reference" }, { "path": "detect_secrets.filters.heuristic.is_likely_id_string" }, { "path": "detect_secrets.filters.heuristic.is_lock_file" }, { "path": "detect_secrets.filters.heuristic.is_not_alphanumeric_string" }, { "path": "detect_secrets.filters.heuristic.is_potential_uuid" }, { "path": "detect_secrets.filters.heuristic.is_prefixed_with_dollar_sign" }, { "path": "detect_secrets.filters.heuristic.is_sequential_string" }, { "path": "detect_secrets.filters.heuristic.is_swagger_file" }, { "path": "detect_secrets.filters.heuristic.is_templated_secret" } ], "results": { "file1.txt": [ { "type": "Secret Keyword", "filename": "file1.txt", "hashed_secret": "45a9584a5a6980909b5bf031b11b9f92c48c56cb", "is_verified": false, "line_number": 1, "description" : "secret=abcd" } ], "file2.txt": [ { "type": "Secret Keyword", "filename": "file2.txt", "hashed_secret": "0040f45fc44a306157299955f97a7437a7956162", "is_verified": false, "line_number": 1, "description" : "secret=xyz" } ] }, "generated_at": "2021-11-23T07:36:45Z" }

StefanFl commented 2 years ago

Your file worked for me, I have a re-import with 2 findings from your scan. Maybe we have done something about deduplication inbetween and it might help upgrading to the most recent version.

reddybhaskarvengala commented 2 years ago

hi @StefanFl I can confirm that its working fine for me after upgrade to the latest version. Thanks.

StefanFl commented 2 years ago

Great

damiencarol commented 2 years ago

@BhaskiBuzz it seems upgrading had fixed your issue. closing it.

DefectDojo / django-DefectDojo