andymeneely / chromium-history

Scripts and data related Chromium's history
11 stars 4 forks source link

Add missed_vuln column to dev_snapshots #243

Closed andymeneely closed 8 years ago

andymeneely commented 8 years ago

For a given developer in a given time period, set a flag for whether or not they missed a vulnerability in any of the files they committed to. The following must be true:

At minimum, you'll need to join the following tables: CodeReview, PatchSetFile, CVENum. Maybe more, but I think that's all you need.

You can try to do this in a Rails query. Or do it in straight SQL - whichever you feel more comfortable with.

Definitely want to come up with some verifies for this. Current dev data should be enough, but I think.

Maybe as another step later: expand "committed to" to "participants".

sso7159 commented 8 years ago

Working Query: Currently have a list of all filepaths associated with any cve in the cvenums table:

SELECT DISTINCT ON(patch_set_files.filepath, code_reviews.issue) code_reviews_cvenums.cvenum_id, code_reviews.issue, code_reviews.created, patch_sets.composite_patch_set_id, patch_set_files.filepath FROM code_reviews INNER JOIN code_reviews_cvenums ON code_reviews.issue = code_reviews_cvenums.code_review_id INNER JOIN patch_sets ON patch_sets.code_review_id = code_reviews.issue INNER JOIN patch_set_files ON patch_sets.composite_patch_set_id = patch_set_files.composite_patch_set_id ORDER BY code_reviews.issue ASC;

Question: I included code_reviews so I could scoop up the created date for that review. I noticed that the code review's created date, last modified date, and the year for the cve (in cve num identifier) are all pretty different (generally by a few years especially for created/modified). Which date should I use for comparison? (and why would be helpful for my understanding! :P)

Also, we did agree this investigation was going to be by filepath right? So I'll have to now look at each filepath's existence in all other reviews and flag developers who reviewed anything involving that filepath?

There are a lot of options for moving forward, so I wanted to pause here for your input before completing the query. :)

sso7159 commented 8 years ago

NEW query - First gets all the filepaths for any cve commit... then all the patch_sets those filepaths were included in (if the patch set is within a year of the cve's CR commit) and then gets all participants on those patch sets... [participants are counted once for each unique patch_set code review for each cve]

SELECT DISTINCT ON(code_reviews_cvenums.cvenum_id, code_reviews.issue, patch_sets.code_review_id, participants.dev_id) code_reviews_cvenums.cvenum_id, code_reviews.issue, commits.created_at, patch_sets.code_review_id, patch_sets.created, patch_sets.composite_patch_set_id, participants.dev_id, commit_filepaths.filepath FROM code_reviews INNER JOIN code_reviews_cvenums ON code_reviews.issue = code_reviews_cvenums.code_review_id INNER JOIN commits ON code_reviews.commit_hash = commits.commit_hash INNER JOIN commit_filepaths ON commits.commit_hash = commit_filepaths.commit_hash INNER JOIN patch_set_files ON patch_set_files.filepath = commit_filepaths.filepath INNER JOIN patch_sets ON patch_sets.composite_patch_set_id = patch_set_files.composite_patch_set_id AND patch_sets.code_review_id != code_reviews.issue AND patch_sets.created > (commits.created_at - interval '1 year') AND patch_sets.created < commits.created_at INNER JOIN participants ON participants.issue = patch_sets.code_review_id ORDER BY code_reviews_cvenums.cvenum_id, code_reviews.issue, patch_sets.code_review_id, participants.dev_id, patch_sets.composite_patch_set_id ASC;

**

andymeneely commented 8 years ago

Here's the same query with indentation and syntax highlighting so it's easier to talk about:

SELECT DISTINCT ON(code_reviews_cvenums.cvenum_id, code_reviews.issue, patch_sets.code_review_id, participants.dev_id)
      code_reviews_cvenums.cvenum_id,
      code_reviews.issue,
      commits.created_at,
      patch_sets.code_review_id,
      patch_sets.created,
      patch_sets.composite_patch_set_id,
      participants.dev_id,
      commit_filepaths.filepath
FROM code_reviews
     INNER JOIN code_reviews_cvenums ON code_reviews.issue = code_reviews_cvenums.code_review_id
     INNER JOIN commits ON code_reviews.commit_hash = commits.commit_hash
     INNER JOIN commit_filepaths ON commits.commit_hash = commit_filepaths.commit_hash
     INNER JOIN patch_set_files ON patch_set_files.filepath = commit_filepaths.filepath
     INNER JOIN patch_sets ON patch_sets.composite_patch_set_id = patch_set_files.composite_patch_set_id AND patch_sets.code_review_id != code_reviews.issue AND patch_sets.created > (commits.created_at - interval '1 year') AND patch_sets.created < commits.created_at
     INNER JOIN participants ON participants.issue = patch_sets.code_review_id
ORDER BY code_reviews_cvenums.cvenum_id,
         code_reviews.issue,
         patch_sets.code_review_id,
         participants.dev_id,
         patch_sets.composite_patch_set_id ASC;

I have one note, but it's big:

We don't need to join onto patch_set and patch_set_files. Those are considered temporary files, so they may or may not land in the final tree. Let's only focus on code that was committed to. So that means the dates should be on commits instead of patch_sets. BUT, you still want the original linkage to commits too because you want the fix commit AND the work commit. Thus, you'll want two inner joins on commits. To tell the two commits apart, you'll need to use AS to alias the table. For example INNER JOIN commits AS fix_commits ON...

andymeneely commented 8 years ago

Ok, I re-written your query (it was SO close!). I think I've got it now. There are some other key changes I made - in particular, joining on the filepath. I also added more AS clauses to be more readable:

SELECT DISTINCT ON(code_reviews_cvenums.cvenum_id, code_reviews.issue, missed_code_reviews.issue, participants.dev_id)
      code_reviews_cvenums.cvenum_id,
      code_reviews.issue,
      fix_commits.created_at,
      work_commits.created_at,
      participants.dev_id,
      work_commit_filepaths.filepath
FROM code_reviews
     INNER JOIN code_reviews_cvenums ON code_reviews.issue = code_reviews_cvenums.code_review_id
     INNER JOIN commits AS fix_commits ON code_reviews.commit_hash = fix_commits.commit_hash
     INNER JOIN commit_filepaths AS fix_commit_filepaths ON fix_commits.commit_hash = fix_commit_filepaths.commit_hash
     INNER JOIN commit_filepaths AS work_commit_filepaths ON work_commit_filepaths.filepath = fix_commit_filepaths.filepath
     INNER JOIN commits AS work_commits ON work_commits.commit_hash = work_commit_filepaths.commit_hash AND work_commits.created_at > (fix_commits.created_at - interval '1 year') AND work_commits.created_at < fix_commits.created_at
     INNER JOIN code_reviews AS missed_code_reviews ON missed_code_reviews.commit_hash = work_commits.commit_hash
     INNER JOIN participants ON participants.issue = missed_code_reviews.issue
ORDER BY code_reviews_cvenums.cvenum_id,
         participants.dev_id;

What do you think? Maybe we should do some manual test cases of some example developers to make sure we get this right.

sso7159 commented 8 years ago

Okay I'll make these changes.

sso7159 commented 8 years ago

changed so the naming convention was a little more consistent and added a few columns for logic checking. It seems pretty good!

SELECT DISTINCT ON(code_reviews_cvenums.cvenum_id, code_reviews.issue, missed_code_reviews.issue, participants.dev_id)
      code_reviews_cvenums.cvenum_id,
      code_reviews.issue,
      fix_commits.created_at,
      missed_code_reviews.issue,
      missed_commits.created_at,
      participants.dev_id,
      missed_commit_filepaths.filepath
FROM code_reviews
     INNER JOIN code_reviews_cvenums ON code_reviews.issue = code_reviews_cvenums.code_review_id
     INNER JOIN commits AS fix_commits ON code_reviews.commit_hash = fix_commits.commit_hash
     INNER JOIN commit_filepaths AS fix_commit_filepaths ON fix_commits.commit_hash = fix_commit_filepaths.commit_hash
     INNER JOIN commit_filepaths AS missed_commit_filepaths ON missed_commit_filepaths.filepath = fix_commit_filepaths.filepath
     INNER JOIN commits AS missed_commits ON missed_commits.commit_hash = missed_commit_filepaths.commit_hash AND missed_commits.created_at > (fix_commits.created_at - interval '1 year') AND missed_commits.created_at < fix_commits.created_at
     INNER JOIN code_reviews AS missed_code_reviews ON missed_code_reviews.commit_hash = missed_commits.commit_hash
     INNER JOIN participants ON participants.issue = missed_code_reviews.issue
ORDER BY code_reviews_cvenums.cvenum_id, code_reviews.issue, missed_code_reviews.issue, participants.dev_id;
sso7159 commented 8 years ago

Realized that we also need a time period WHERE in the query. Since we are querying for collaborations during some time period, I figure it makes sense for this query to look only at the code reviews that were created in the same time period we are looking at the collaborations for. ?? Does this make sense?

SELECT DISTINCT ON(code_reviews_cvenums.cvenum_id, code_reviews.issue, missed_code_reviews.issue, participants.dev_id)
      code_reviews_cvenums.cvenum_id,
      code_reviews.issue,
      fix_commits.created_at,
      missed_code_reviews.issue,
      missed_commits.created_at,
      participants.dev_id,
      missed_commit_filepaths.filepath
FROM code_reviews
     INNER JOIN code_reviews_cvenums ON code_reviews.issue = code_reviews_cvenums.code_review_id
     INNER JOIN commits AS fix_commits ON code_reviews.commit_hash = fix_commits.commit_hash
     INNER JOIN commit_filepaths AS fix_commit_filepaths ON fix_commits.commit_hash = fix_commit_filepaths.commit_hash
     INNER JOIN commit_filepaths AS missed_commit_filepaths ON missed_commit_filepaths.filepath = fix_commit_filepaths.filepath
     INNER JOIN commits AS missed_commits ON missed_commits.commit_hash = missed_commit_filepaths.commit_hash AND missed_commits.created_at > (fix_commits.created_at - interval '1 year') AND missed_commits.created_at < fix_commits.created_at
     INNER JOIN code_reviews AS missed_code_reviews ON missed_code_reviews.commit_hash = missed_commits.commit_hash
     INNER JOIN participants ON participants.issue = missed_code_reviews.issue
WHERE missed_code_review.created >= '" + early boundary + "' AND missed_code_review.created < '" + late boundary + "' +
ORDER BY code_reviews_cvenums.cvenum_id, code_reviews.issue, missed_code_reviews.issue, participants.dev_id;
sso7159 commented 8 years ago

More corrections

SELECT DISTINCT ON( missed_code_reviews.issue, participants.dev_id)
      code_reviews_cvenums.cvenum_id,
      code_reviews.issue,
      fix_commits.created_at,
      missed_code_reviews.issue,
      missed_commits.created_at,
      participants.dev_id,
      missed_commit_filepaths.filepath
FROM code_reviews
     INNER JOIN code_reviews_cvenums ON code_reviews.issue = code_reviews_cvenums.code_review_id
     INNER JOIN commits AS fix_commits ON code_reviews.commit_hash = fix_commits.commit_hash
     INNER JOIN commit_filepaths AS fix_commit_filepaths ON fix_commits.commit_hash = fix_commit_filepaths.commit_hash
     INNER JOIN commit_filepaths AS missed_commit_filepaths ON missed_commit_filepaths.filepath = fix_commit_filepaths.filepath
     INNER JOIN commits AS missed_commits ON missed_commits.commit_hash = missed_commit_filepaths.commit_hash AND missed_commits.created_at > (fix_commits.created_at - interval '1 year') AND missed_commits.created_at < fix_commits.created_at
     INNER JOIN code_reviews AS missed_code_reviews ON missed_code_reviews.commit_hash = missed_commits.commit_hash
     INNER JOIN participants ON participants.issue = missed_code_reviews.issue
WHERE  missed_code_reviews.created >= '" + early boundary + "' AND missed_code_reviews.created < '" + late boundary + "' ORDER BY missed_code_reviews.issue, participants.dev_id;
sso7159 commented 8 years ago

Everything has been cleaned up in developer_snapshots table, all columns have been verified manually, and we have all the vuln_misses, vuln_fixes metrics completed!