Fortify SCA SARIF inaccuracy causing poor GitHub Code Scanning experience

simon-engledew commented 7 months ago

👋 Hello! Not sure if this is the right place to raise this issue, but we've noticed that the way Fortify SCA is generating SARIF documents is causing a bad user experience with GitHub Code Scanning.

Code Scanning expects that rule metadata will be shared many times between different runs of a Code Scanning tool. A rule should represent a capability of the tool, not information about any specific finding. Information that is scan-specific should be included in the results message field instead (e.g: file paths, container checksums etc).

Fortify SCA appears to be generating large numbers of rules, each one with unique alert specific information in the help text. Possibly due to the configuration in these files?

rsenden commented 7 months ago

@simon-engledew, thanks for reporting this. I'm aware that the output isn't strictly following the expectation of what kind of data is supplied in the rules element and what kind of data is supplied in the results element.

Primary cause is not necessarily in FortifyVulnerabilityExporter itself, but in how Fortify SSC/FoD expose vulnerability data in the REST API that is used by this integration. At the time when I originally developed this, generating a separate rule for every finding seemed like the best solution and didn't seem to have any significant impact on user experience.

Can you give some specific examples / screenshots as to how the current approach is negatively impacting user experience? Based on that, I can research whether there may be better ways to represent rules and results.

However, again this is dependent on data provided by the SSC/FoD REST API. I don't think current APIs return generic, issue-agnostic Fortify category descriptions, so if we are to generate issue-agnostic rules, most likely each rule would contain only id and shortDescription fields, and no fullDescription, help or properties fields.

simon-engledew commented 7 months ago

Thanks for the explanation! That totally tracks with what I am seeing. 👍

For context I work at GitHub, hi! 👋 😅 – we have various internal limits to stop Code Scanning tools breaking UI elements / internal endpoints and Fortify SCA is tripping them as it generates so many more rules than anything else 😬 😂. Things don't look too broken but we are discarding rules to keep pages loading. 🙇

In the short term, if you could drop fullDescription/help/properties from rules and put some or all of the information in the results message field that would avoid creating so many unique rules. ❤️

rsenden commented 7 months ago

Thanks for the additional info. So, if you're discarding rules, I guess users may not see the bottom block (tool/rule-id and description) from the screenshot below if we report many issues? At what point does GitHub start discarding rules?

If I reduce/de-duplicate the rules by removing issue-specific information, obviously rule id's for existing issues will change. Will this have any impact on issues/rules reported by previous runs of this utility on an existing repository, other than having a different layout?

As in the example below, the text that's currently displayed in the rule block may be fairly long, so I don't think it would provide good user experience if we move all this information to the results message field, as that's shown inline with code snippets. Does the text block that shows the results message field also show a 'Show More' link if the text is too long, or will it always show the full text?

If there's no text to be shown in the rule block, will GitHub still show that block? I think it would look a bit weird to show an empty rule block.

simon-engledew commented 7 months ago

So, if you're discarding rules, I guess users may not see the bottom block (tool/rule-id and description) from the screenshot below if we report many issues?

Funnily enough that's not actually a problem! 😂 It's more the ancillary things like API calls and rule selector drop-downs - we also have some internal indexing that's returning truncated results.

At what point does GitHub start discarding rules?

IIRC we have an internal limit of 1,000. Fortify SCA is on about 300,600. 😂

rule id's for existing issues will change. Will this have any impact on issues/rules reported by previous runs of this utility on an existing repository

The rule ID does need to line up with the previous analysis in order for the alert to be classified as the same, yeah. 😞

If there's no text to be shown in the rule block, will GitHub still show that block? I think it would look a bit weird to show an empty rule block.

It will show the block with the text "No rule help". Not ideal. 😞

The main problem is the large number of unique SARIF identifiers. 😱 Ultimately it is our responsibility to make this work regardless of what SARIF we receive but at some point we may have to start cleaning up data or redacting rules. 🤔

It sounds like there isn't much you can do given the output the tool gives you?

rsenden commented 7 months ago

@simon-engledew, thanks again for the info.

Given that number of 300,600, I guess this means that you're storing rules across repositories rather than per-repository?

I was expecting that both rules and results elements would be processed per repository, and thus the large number of rules would potentially only be an issue for repositories with many issues (even though GitHub has a limit of 1000 issues per SARIF file). If you collect rules across repositories, I can see why that might lead to issues if we generate a new rule for every issue ;)

I've been looking in more detail at the data that's being returned by our product REST endpoints:

For FoD, although it would make the integration a bit more complex, we can potentially generate issue-agnostic rules similar to what's shown here: https://vulncat.fortify.com/en/weakness.
SSC however currently only returns issue-specific details, so if we were to implement this request, we'd end up with rules without any description.

Ideally I'd like to keep SARIF output for both integrations the same though.

Even if we were able to generate issue-agnostic SARIF rules, there's also a lot of issue-specific information that we're currently showing. I'm not sure whether it's viable to move all this information to the results message field, given that this is shown inline with the code on GitHub. Do you have any suggestions as to how to best handle this?

Side note, I'm migrating this functionality to another multi-purpose CLI tool and eventually FortifyVulnerabilityExporter will be deprecated. So, if I were to implement any changes based on this issue, it will probably go into the other tool:

simon-engledew commented 7 months ago

I guess this means that you're storing rules across repositories rather than per-repository?

Yeah, it's a lot of data so we have to deduplicate the storage of rule text.

Do you have any suggestions as to how to best handle this?

No, sorry; I think fundamentally SARIF is quite tailored to the way CodeQL wants to send results. 😞 Your best bet is to take a look at how CodeQL structures its rules and messages and try to mimic that as much as possible. 😬

Here is an example of CodeQL's rule help for an integer overflow issue: https://github.com/github/codeql/blob/a9bab18804e28159c81f6ab7b083df53b58f367f/go/ql/src/Security/CWE-681/IncorrectIntegerConversionQuery.qhelp#L5

That ends up with this nice (fairly general) help document that is true for all occurrences of an alert result:

and a fairly short result message:

Side note, I'm migrating this functionality to another multi-purpose CLI tool and eventually FortifyVulnerabilityExporter will be deprecated. So, if I were to implement any changes based on this issue, it will probably go into the other tool:

That's helpful context, thanks. ❤

rsenden commented 7 months ago

Thanks again for the info. In theory, we should be able to generate similar issue-agnostic rule descriptions, again similar to what's for example shown here.

As mentioned, the problem is that the products that FortifyVulnerabilityExporter integrates with don't expose these issue-agnostic descriptions (not at all for SSC, somewhat difficult to extract for FoD). I've already raised this concern to the respective product managers, but even if they agree to implement features to support this use case, it will probably take considerable amount of time before this becomes available.

As an alternative, I'll investigate whether it's possible to get the issue-agnostic rule descriptions from another source like VulnCat (or the data behind VulnCat), but currently VulnCat doesn't offer any API for this purpose.

Until then, I only see two options; leave things as they are, or use empty rule descriptions (at least for the SSC integration). The latter would however have a significant impact on user experience.

I'll keep you posted on any progress.

Edit: side question, how does GitHub handle changes to rule descriptions over time? Does GitHub potentially store multiple descriptions for the same rule id (showing the description that matches the contents of the SARIF file for a particular repository), or will it only store the first/last submitted rule description?

simon-engledew commented 6 months ago

Does GitHub potentially store multiple descriptions for the same rule id (showing the description that matches the contents of the SARIF file for a particular repository), or will it only store the first/last submitted rule description?

We store the changes over time. 👍

rsenden commented 6 months ago

Hi @simon-engledew, I'm still researching options to improve our SARIF output, and have one additional question for now.

Our issue-specific rule descriptions currently include an issue-specific link back to our portal, allowing users to view additional issue details for an individual issue. If we were to generate issue-agnostic rules (as they're meant to be :wink:), where should this link go? Will GitHub properly render any links in results[].message.text, or is there some other SARIF property where we can store the issue-specific links in such a way to GitHub presents this as a clickable link to the user?

Thanks, Ruud

simon-engledew commented 6 months ago

Will GitHub properly render any links in results[].message.text

I think you might need to use results[].message.markdown, but yeah! If you include a markdown link then it will be clickable:

rsenden commented 6 months ago

@simon-engledew, thanks! According to https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/sarif-support-for-code-scanning#result-object, only results[].message.text is supported, I guess this documentation may need to be updated then, to state that results[].message.markdown is also supported :wink:

simon-engledew commented 6 months ago

I think the SARIF spec is right so it's quite likely I've either a) made a mistake or b) we support it as a happy accident. 😅

Looking through a big old folder of example SARIF files I don't see any that use results[].message.markdown so I think it's the former.

That said, I definitely do see the link being rendered when I send some markdown so maybe both things are true. 😂

rsenden commented 6 months ago

@simon-engledew Not sure I understand, do you mean that results[].message.text may contain Markdown links? Can you share an example SARIF file/snippet that includes some links that get properly rendered on GitHub? Saves me from trial and error 😉

simon-engledew commented 6 months ago

Sure!

I've modified a CodeQL SARIF file and added a link into `results[].message.text`.

```json { "$schema" : "https://raw.githubusercontent.com/oasis-tcs/sarif-spec/master/Schemata/sarif-schema-2.1.0.json", "version" : "2.1.0", "runs" : [ { "tool" : { "driver" : { "name" : "CodeQL", "organization" : "GitHub", "semanticVersion" : "2.0.0", "rules" : [ { "id" : "js/unused-local-variable", "name" : "js/unused-local-variable", "shortDescription" : { "text" : "Unused variable, import, function or class" }, "fullDescription" : { "text" : "Unused variables, imports, functions or classes may be a symptom of a bug and should be examined carefully." }, "defaultConfiguration" : { "level": "note" }, "properties" : { "tags" : [ "maintainability" ], "kind" : "problem", "precision" : "very-high", "name" : "Unused variable, import, function or class", "description" : "Unused variables, imports, functions or classes may be a symptom of a bug\n and should be examined carefully.", "id" : "js/unused-local-variable", "problem.severity" : "recommendation" } }] } }, "results" : [ { "ruleId" : "js/unused-local-variable", "ruleIndex" : 0, "message" : { "text" : "Unused variable [foo](https://github.com/)." }, "locations" : [ { "physicalLocation" : { "artifactLocation" : { "uri" : "main.js", "uriBaseId" : "%SRCROOT%", "index" : 0 }, "region" : { "startLine" : 2, "startColumn" : 7, "endColumn" : 10 } } } ], "partialFingerprints" : { "primaryLocationLineHash" : "39fa2ee980eb94b0:1", "primaryLocationStartColumnFingerprint" : "4" } }], "newlineSequences" : [ "\r\n", "\n", " ", " " ], "columnKind" : "utf16CodeUnits", "properties" : { "semmle.formatSpecifier" : "sarif-latest" } } ] } ```

Specifically:

"message" : {
        "text" : "Unused variable [foo](https://github.com/)."
      },

I've uploaded it to a test repository and I can use the link:

rsenden commented 6 months ago

Great, thanks! According to the SARIF specification, results[].message.text should only contain plain text, not Markdown, but if this works, it's fine for me :smile:

rsenden commented 6 months ago

Hi @simon-engledew, can you please review the SARIF files in this repository (and associated GitHub Code Scanning alerts if you have access) to verify that these meet GitHub expectations?

Note that different Fortify rules may generate the same types of results, so there may be some duplication in rule descriptions. For example, the fod.sarif file in the beforementioned repository lists rule id's 10951BF4-F239-4F1D-8ADB-B12DE560960F and CEA38018-69EE-45F9-8CD4-3FE97EC8F628 that both have the same description.

As mentioned before, I won't be updating SARIF output for FortifyVulnerabilityExporter as we plan on deprecating this tool (and proper fix would be difficult to implement based on current tool architecture). We'll advise customers to migrate to fcli once we release similar export functionality. For reference, the SARIF files referenced above were generated using an fcli development version, based on these yaml files:

Side question 1: GitHub previously allowed importing a maximum of 1000 results (and would throw an error if a SARIF file contained more than 1000 results), but it looks like more results are now supported: https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/sarif-support-for-code-scanning#validating-your-sarif-file. Can you please confirm?

Side question 2: You mentioned that rule descriptions get deduplicated across all repositories on GitHub. Based on some quick tests, I assume you deduplicate based on description hashes, and not based on other SARIF properties like rule id, correct?

Or, in other words, using a SARIF file like this one, it wouldn't be possible to 'hijack' rule descriptions (with same rule id and tool properties) in other repositories, correct?

simon-engledew commented 4 months ago

Sorry for taking so long to get back to you! 🙇 😓

can you please review the SARIF files in this repository (and associated GitHub Code Scanning alerts if you have access) to verify that these meet GitHub expectations?

I can already see that you only use 45 rules for ~800 alerts so that looks like a big improvement 😂

I guess the question is, do you generate the same rule between multiple uploads? It looks like there have been 77 uploads (2 of which have alerts) and 45 unique rules across 818 alerts, but there is no overlap at all between the two analyses. I would expect that both analysis should share the same rules if they are describing the same alert. (This might just be because it's test data though).

Looks like your 'more information' link points to localhost? I figure you already know that but I thought I would mention it. 😁

Side question 1: GitHub previously allowed importing a maximum of 1000 results (and would throw an error if a SARIF file contained more than 1000 results), but it looks like more results are now supported

Yup! Looks like the default limit is currently 5000.

You mentioned that rule descriptions get deduplicated across all repositories on GitHub. Based on some quick tests, I assume you deduplicate based on description hashes, and not based on other SARIF properties like rule id, correct?

it wouldn't be possible to 'hijack' rule descriptions (with same rule id and tool properties) in other repositories, correct?

Yes 👍

fortify / FortifyVulnerabilityExporter

Fortify SCA SARIF inaccuracy causing poor GitHub Code Scanning experience #72