TQRG / security-patches-dataset

☠️ Ground-truth dataset for vulnerability prediction (known research datasets and data sources included such as NVD, CVE Details and OSV); tools to automatically update the data are provided.
https://arxiv.org/abs/2110.09635
MIT License
81 stars 26 forks source link

Question: Getting code changes #1

Closed davidlee321 closed 2 years ago

davidlee321 commented 3 years ago

Hello. Appreciate the work that went into creating this sizeable dataset. Are there scripts/code for getting commits' code changes? Looks like scripts/download.py contains most of the logic needed and a little bit of modification should do the trick.

Thanks @sofiaoreis

sofiaoreis commented 3 years ago

Hi @davidlee321,

Thanks for reaching out! The scripts/download.py will download the entire codebases for the different versions involved in an entry of the db/patch (vulnerable and fixed versions). If you're only looking for the diff information between both versions, it won't work for you. You may have to perform some changes to that script.

I plan to add that feature in the future. But I'm currently integrating other datasets and extracting the cve-details data from 2020 and 2021. I will release more data very soon.

Feel free to contribute to the repo with a PR, if you want to. I will list you as a contributor.

davidlee321 commented 3 years ago

The github API lets commits be read from Github repositories without cloning them. Example:

curl \
  -H "Accept: application/vnd.github.v3.diff" \
  https://api.github.com/repos/TQRG/security-patches-dataset/commits/6ee8095bb232d1e190bf453af09fc87cae558d02

returns the code changes this particular commit made.

I read that some common Git-related python packages require the repo to be cloned before commits can be read. But with the Github API cloning isn't required.

The python request equivalent is:

import requests
owner = "TQRG"
repo = "security-patches-dataset"
hash = "6ee8095bb232d1e190bf453af09fc87cae558d02"
query_url = f"https://api.github.com/repos/{owner}/{repo}/commits/{hash}"
headers = {'Accept': 'application/vnd.github.v3.diff'}
r = requests.get(query_url, headers=headers)
print(r.text)
davidlee321 commented 3 years ago

My python solution:


import requests
import time
import pandas as pd
import numpy as np

df = pd.read_csv('dataset/positive.csv')
df['diff'] = np.nan # create empty column to write code diffs to later.

token = 'ghpas1244y6u23c4*****************'
rate_limit_per_sec = 5000 / 60 / 60     # Github API's rate limit for basic (non-enterprise) users.
duration_per_request_minimum_s = (1.0 / rate_limit_per_sec) * 1.05      # additional 5% time leeway.
report_every = 100
errors = []

for i in range(len(df)):
    start = time.time()

    hash = df['sha'][i]
    project = df['project'][i]
    if 'github' not in project.lower():
        print(f'[WARN] {project} (row {i}) not on Github. Cannot use Github API. Skipped.')
        errors.append(f'row {i} not-github-project')
        continue #skip row
    split = project.split('/')
    owner, repo = split[-2], split[-1]
    query_url = f"https://api.github.com/repos/{owner}/{repo}/commits/{hash}"
    headers = {'Accept': 'application/vnd.github.v3.diff'}
    if token:
        headers['Authorization'] = f'token {token}'
    r = requests.get(query_url, headers=headers)
    if 200 == r.status_code:
        df['diff'][i] = r.text
        # print(r.text)
        # print('-' * 20)
    else:
        print(f'[WARN] GET request for {project} (row {i}) failed with status code {r.status_code}. Skipped.')
        errors.append(f'row {i} status-not-200')
        continue #skip row

    duration = time.time() - start
    if duration < duration_per_request_minimum_s:
        sleep_s = duration_per_request_minimum_s - duration
        time.sleep(sleep_s)

    if i % report_every == 0:
        print(f'[INFO] Done {i+1} rows.')

if errors:
    print('[WARN] ERRORS DETECTED. SEE VARIABLE `errors` FOR DETAILS.')
else:
    print('[INFO] No errors.')

df

output (in case if helpful):

[INFO] Done 1 rows.
[INFO] Done 101 rows.
[INFO] Done 201 rows.
[INFO] Done 301 rows.
[INFO] Done 401 rows.
[INFO] Done 501 rows.
[INFO] Done 601 rows.
[INFO] Done 701 rows.
[INFO] Done 801 rows.
[INFO] Done 901 rows.
[INFO] Done 1001 rows.
[INFO] Done 1101 rows.
[INFO] Done 1201 rows.
[INFO] Done 1301 rows.
[INFO] Done 1401 rows.
[INFO] Done 1501 rows.
[INFO] Done 1601 rows.
[INFO] Done 1701 rows.
[INFO] Done 1801 rows.
[INFO] Done 1901 rows.
[INFO] Done 2001 rows.
[INFO] Done 2101 rows.
[INFO] Done 2201 rows.
[INFO] Done 2301 rows.
[INFO] Done 2401 rows.
[INFO] Done 2501 rows.
[INFO] Done 2601 rows.
[INFO] Done 2701 rows.
[INFO] Done 2801 rows.
[INFO] Done 2901 rows.
[INFO] Done 3001 rows.
[INFO] Done 3101 rows.
[INFO] Done 3201 rows.
[INFO] Done 3301 rows.
[INFO] Done 3401 rows.
[INFO] Done 3501 rows.
[INFO] Done 3601 rows.
[INFO] Done 3701 rows.
[INFO] Done 3801 rows.
[INFO] Done 3901 rows.
[INFO] Done 4001 rows.
[INFO] Done 4101 rows.
[INFO] Done 4201 rows.
[INFO] Done 4301 rows.
[INFO] Done 4401 rows.
[INFO] Done 4501 rows.
[INFO] Done 4601 rows.
[INFO] Done 4701 rows.
[WARN] GET request for https://github.com/blynkkk/blynk-server (row 4741) failed with status code 404. Skipped.
[INFO] Done 4801 rows.
[INFO] Done 4901 rows.
[INFO] Done 5001 rows.
[INFO] Done 5101 rows.
[INFO] Done 5201 rows.
[INFO] Done 5301 rows.
[INFO] Done 5401 rows.
[INFO] Done 5501 rows.
[INFO] Done 5601 rows.
[INFO] Done 5701 rows.
[WARN] GET request for https://github.com/vintagedaddyo/MyBB_Plugin-ChangUonDyU-Advanced-Statistics (row 5774) failed with status code 404. Skipped.
[WARN] GET request for https://github.com/vintagedaddyo/MyBB_Plugin-adminnotes (row 5784) failed with status code 404. Skipped.
[INFO] Done 5801 rows.
[INFO] Done 5901 rows.
[INFO] Done 6001 rows.
[INFO] Done 6101 rows.
[WARN] GET request for https://github.com/josh/rack-ssl (row 6185) failed with status code 404. Skipped.
[INFO] Done 6201 rows.
[WARN] GET request for https://github.com/diversen/gallery (row 6273) failed with status code 404. Skipped.
[INFO] Done 6301 rows.
[INFO] Done 6401 rows.
[INFO] Done 6501 rows.
[INFO] Done 6601 rows.
[WARN] GET request for https://github.com/phreebooks/PhreeBooksERP (row 6667) failed with status code 404. Skipped.
[INFO] Done 6701 rows.
[INFO] Done 6801 rows.
[INFO] Done 6901 rows.
[INFO] Done 7001 rows.
[INFO] Done 7101 rows.
[WARN] GET request for https://github.com/javaserverfaces/mojarra (row 7141) failed with status code 422. Skipped.
[WARN] GET request for https://github.com/javaserverfaces/mojarra (row 7142) failed with status code 422. Skipped.
[INFO] Done 7201 rows.
[INFO] Done 7301 rows.
[INFO] Done 7401 rows.
[INFO] Done 7501 rows.
[INFO] Done 7601 rows.
[INFO] Done 7701 rows.
[INFO] Done 7801 rows.
[INFO] Done 7901 rows.
[INFO] Done 8001 rows.
[WARN] ERRORS DETECTED. SEE VARIABLE `errors` FOR DETAILS.
sofiaoreis commented 3 years ago

Hi @davidlee321,

Great! Many thanks! 🙏 This approach will definitely work for cases that were fixed by only one commit.

But I'm afraid this won't be enough for the patches that involve more than one commit. In those cases, I believe we need to make the diff between the last fix and the vulnerable version which will be the parent of the first fix/commit.

For instance, if we have the current chain of commits A, B, C, and D where B, C, and D are the commits used to patch the vulnerability, then, the diff should be between A and D (where A is the parent of B--the first fix--and the vulnerable version). I'm currently working on this (scripts/get_code_changes.py) but I think I need to improve the data first.

Maybe I should update the data by performing the following changes:

What do you think?

davidlee321 commented 3 years ago

no problem. thank you for building & curating this dataset.

Regarding the challenge of saving a chain of commits, don't think I have an enlightened opinion. I agree with the ideas outlined. I think it's a good way forward for now. If users would like to obtain commits along the chain (I believe these are A, B, C in your comment), they might opt to use, say, github's API to manually grab them (like I did above).

After finding the proportion of samples that have this chain-of-commits situation, contributors will have a better idea of whether to deem this a special case or a common one. If common, contributors might like to consider providing these commits in the dataset. If this direction were taken, to my knowledge, this dataset would be quite impressive in terms of completeness & comprehensiveness relative to other security-related commit datasets publicly available.