It should be relatively straightforward to create a tool that can determine many of the code constructs that are causing a specific tool problems. Using the expected results full details file and yaml file from a generated test suite, and the actual results for a particular tool (from BenchmarkScore), do something like:
Generate a list of every code snippet used to generate that test suite (Straight from the YAML file?).
Create bidirectional data structure like so:
[ ] Create a data structure for each code snippet that has links to every test case that uses it.
[ ] Create a data structure for each test case that has links to every code snippet used in it.
[ ] Pass 1: Go through each True Positive detected by the tool and mark each code snippet used in it as [correctly understood.] (i.e., both or all 3)
[ ] Pass 2: Go through each test case and identify any where only 1 of the snippets left is not 'understood' and generate lists for the sources, dataflows, and sinks to focus on.
[ ] Sanity Check: See if there are any test cases with all snippets checked, but the tool reports a False Positive.
[ ] Once this is working, update tool to automatically calculate this for every actual results file in the /scorecard directory. (i.e., do this for ALL tools)
I was thinking this might require multiple analysis phases, but I think that's it?
Stage 2:
[ ] Do something similar to detect False Positive problem areas, but analyzing what they report as TPs (but are FPs).
It should be relatively straightforward to create a tool that can determine many of the code constructs that are causing a specific tool problems. Using the expected results full details file and yaml file from a generated test suite, and the actual results for a particular tool (from BenchmarkScore), do something like:
Generate a list of every code snippet used to generate that test suite (Straight from the YAML file?).
Create bidirectional data structure like so:
[ ] Create a data structure for each code snippet that has links to every test case that uses it.
[ ] Create a data structure for each test case that has links to every code snippet used in it.
[ ] Pass 1: Go through each True Positive detected by the tool and mark each code snippet used in it as [correctly understood.] (i.e., both or all 3)
[ ] Pass 2: Go through each test case and identify any where only 1 of the snippets left is not 'understood' and generate lists for the sources, dataflows, and sinks to focus on.
[ ] Sanity Check: See if there are any test cases with all snippets checked, but the tool reports a False Positive.
[ ] Once this is working, update tool to automatically calculate this for every actual results file in the /scorecard directory. (i.e., do this for ALL tools)
I was thinking this might require multiple analysis phases, but I think that's it?
Stage 2: