Review Process described in readme

jensdietrich commented 11 months ago

review and discuss process described in readme that will be used to construct oracle, also look at schema

mmabdpr commented 11 months ago

for each vulnerability record, we have >=1 patch commits and >=1 version ranges. how should we decide/verify the association of theses two sets?

one possible mitigation is only keeping records that have a) 1 commit and >1 ver ranges (somehow that single commit resolved the vuln in multiple versions? not sure if/how it's possible) b) >1 commits and 1 ver range (we consider all changes)

jensdietrich commented 11 months ago

@mmabdpr I think we can trust that quality of the data sources here, perhaps sample this for some records by looking into the commit history, and cross-referencing this with a tag that corresponds to a released version

jensdietrich commented 11 months ago

@behnazh-w does the process outlined in the readme make sense to you ?

wtwhite commented 11 months ago

Hi @mmabdpr :) My thoughts below -- @jensdietrich please correct as necessary.

I think we ideally want to keep just a single "best representative" pair of versions (vulnerable, fixed) per CVE, where "best" for our purposes means the pair that has "minimum code difference" in some sense -- this avoids as much noise as possible.

To be a candidate, a version pair should be adjacent in the Maven Central Repo. There's a candidate for each range element in the GHSA that has a fixed entry: Take that as the fixed version, and for the vulnerable version, look up its immediate predecessor version (semver-wise) in the Maven Central Repo.

Then, from among the candidate pairs, I can think of a few ways we could estimate the minimum code difference in order to pick the best pair:

Just take the oldest or the newest arbitrarily (easy, might be good enough)
Take the pair minimising the difference in size between compiled jars
Take the pair minimising git diff fixed-tag vulnerable-tag | wc -l or similar (requires mapping Maven versions to git tags which is usually possible and we should probably do anyway; might be possible to do on GitHub directly)

Note this strategy could result in the "best" pair not being on the trunk/latest version lineage where the vuln was presumably fixed first, since those versions may contain many other features and bugfixes -- commits that backport the fix to older versions may be more likely to fix just that specific vuln.

behnazh-w commented 11 months ago

The process sounds reasonable overall except for precise mapping as discussed above.

3. requires mapping Maven versions to git tags which is usually possible and we should probably do anyway

The next release of our tool, macaron will have this feature. We are able to map artifacts to commits using tags in their corresponding repo. We also have a feature to automatically find the corresponding repository in the first place, e.g., if you pass the purl for the artifact as follows:

macaron -o output analyze -purl pkg:maven/org.opentest4j/opentest4j@1.2.0?type=jar --skip-deps

we map it to 75136304fab712895090c9c4678dc72ccbcb5e21

We will release version 0.6.0 in the next two weeks.

behnazh-w commented 11 months ago

Have you tried PatchMatch?

binaryeq / jpatch

Review Process described in readme #2