b4yuan commented 11 months ago

Summary

This pull request introduces a CodeQL workflow to enhance the security analysis of this repository.

What is CodeQL

CodeQL is a static analysis tool that helps identify and mitigate security vulnerabilities. It is primarily intra-function but does provide some support for inter-function analysis. By integrating CodeQL into a GitHub Actions workflow, it can proactively identify and address potential issues before they become security threats.

For more information on CodeQL and how to interpret its results, refer to the GitHub documentation and the CodeQL documentation (https://codeql.github.com/ and https://codeql.github.com/docs/).

What this PR does

We added a new CodeQL workflow file (.github/workflows/codeql.yml) that

Runs on every pull request (functionality to run on every push to main branches is included as a comment for convenience).
Runs daily.
Excludes queries with a high false positive rate or low-severity findings.
Does not display results for git submodules, focusing only on our own codebase.

Validation

To validate the functionality of this workflow, we have run several test scans on the codebase and reviewed the results. The workflow successfully compiles the project, identifies issues, and provides actionable insights while reducing noise by excluding certain queries and third-party code.

Using the workflow results

If this pull request is merged, the CodeQL workflow will be automatically run on every push to the main branch and on every pull request to the main branch. To view the results of these code scans, follow these steps:

Under the repository name, click on the Security tab.
In the left sidebar, click Code scanning alerts.

Is this a good idea?

We are researchers at Purdue University in the USA. We are studying the potential benefits and costs of using CodeQL on open-source repositories of embedded software.

We wrote up a report of our findings so far. The TL;DR is “CodeQL outperforms the other freely-available static analysis tools, with fairly low false positive rates and lots of real defects”. You can read about the report here: https://arxiv.org/abs/2310.00205

Review of engineering hazards

License: see the license at https://github.com/github/codeql-cli-binaries/blob/main/LICENSE.md:

Here's what you may also do with the Software, but only with an Open Source Codebase and subject to the License Restrictions provisions below:

Perform analysis on the Open Source Codebase.

If the Open Source Codebase is hosted and maintained on GitHub.com, generate CodeQL databases for or during automated analysis, CI, or CD.

False positives: We find that around 20% of errors are false positives, but that these FPs are polarized and only a few rules contribute to most FPs. We find that the top rules contributing to FPs are: cpp/uninitialized-local, cpp/missing-check-scanf, cpp/suspicious-pointer-scaling, cpp/unbounded-write, cpp/constant-comparison, and cpp/inconsistent-null-check. Adding a filter to filter out certain rules that contribute to a high FP rate can be done simply in the workflow file.

b4yuan commented 11 months ago

@gek169 This pull request was designed for tinygl. If you're still hesitant: for documentation purposes, would you be willing to share why you would prefer to not incorporate CodeQL into your repository even though it has been set up for this repo and tested? E.g. false positive rate, maintenance, etc.

b4yuan commented 11 months ago

@gek169 Thanks for your comments.

Those comments are from the starter workflow in GitHub actions for CodeQL configuration: see https://github.com/actions/starter-workflows/blob/main/code-scanning/codeql.yml. If you went to the actions tab of this repository, created a new workflow, and selected CodeQL Analysis, this is the default workflow that Github provides you with. It has been modified to support a tinygl build script, a script that forces the workflow to fail if there are errors, and filters to remove rules or directories for scanning.

Your comments don't explicitly mention "why no CodeQL". Issue https://github.com/C-Chads/tinygl/issues/26 was found with CodeQL scanning on our fork of this repository, which was resolved and brought this repository value.

davisjam commented 11 months ago

@gek169 Thank you for your consideration of this PR. I want to reply to a few of your questions and comments. I'm Prof. James Davis of Purdue University and I am working on the aforementioned research project with Prof. Aravind Machiry of the same, as well as some undergraduate and graduate assistants.

First of all, I want to say that we are sorry if we have caused you any distress. Our intention here is only to improve open-source projects through PRs that add CodeQL checks.

Now, to answer your comments and concerns:

Automatically generated pull requests are not allowed...This code is auto-generated. If this is true, then it is not allowed in the repository....ROBOCOMMITS ARE NOT ALLOWED, AND IF YOU SAY YOU'RE NOT WRITING ROBOCOMMITS BUT WRITE COMMENTS THAT SAY YOU DID... well... that definitely doesn't help your case.

We were unaware of this policy in this repository. Is that policy documented somewhere that we missed? We saw a few commits labeled "Automatic commit" so it didn't seem like there was a blanket policy against automation.

Furthermore, I have seen your group make these automatic PRs to other repositories I have been watching. This behavior should be condemned by your university ethics committee.

At least in the USA, it is unusual for a university to have an ethics committee. Universities commonly have Institutional Review Boards (IRBs) which provide oversight on research whose purpose is to collect data from human subjects. We do have IRB oversight on aspects of this project, but "Offer improvements to the open-source community" is engineering work rather than human subjects data collection. If faculty make an ethical misjudgment, it is up to their peers (e.g. peer review) or an ad hoc university committee to censure them.

I am not sure why you feel it is unethical to create pull requests that add CodeQL workflows to projects to improve their cybersecurity. Is it the fact that we did so with automated assistance? (What about that seems problematic to you?) Many of these PRs have been merged by the engineering teams after they inspected them.

I find this incompatible with our licensing scheme as it is currently unknown what the copyright status of this commit is.

My understanding (perhaps mistaken) is that PRs that do not indicate copyright status are "donations" to the project to which they are offered. That was our intent. We don't care about copyright, we care about improving code quality.

I doubt this has been tested.

If it's any reassurance, we did test it.

I doubt the authenticity of what you're doing. You're acting like the code you've committed wasn't auto-generated but it contains comments that say it was.

Some of the code is generated -- and of course, if you compile it then the compiler will also be generating assembly instructions for your machine. I'm not sure what part of this interaction suggests to you that we are "acting like it wasn't auto-generated". I don't see that in any of the comments from @b4yuan.

Perhaps you generated it using some tool and are trying to pass it off as your own work? Perhaps you've figured out some sort of exploit with github to steal account information?

This repository is an open-source project. We opened a PR to it with an open-source contribution. If you do not trust us from the PR message, you are welcome to examine the code of the PR.

I think you people are trying to run some sort of scam or something. If not that, you're doing some sort of really bad human experimentation and that's not allowed either. Go mess with the linux kernel people like UofM did and see if you can get them to put backdoors in.

We are not trying to run a scam. We are aware of no backdoors in the PR. We have opened many PRs at the conclusion of this research project. Our findings are linked from the PR text -- the paper is on arXiv.

davisjam commented 11 months ago

@gek169

I would recommend making it explicitly clear that your commits are NOT automatically generated but rather TOOL ASSISTED (human-reviewed and tested). This was not clear to me before, and in fact, it seemed that the exact opposite was being communicated.

Thank you for this suggestion. We are still learning how to do this and I appreciate the recommendation.

I hope that you are providing your graduate student workforce (Who have to deal with angry bridge-troll maintainers such as myself) with ample recognition?

We try!

My second guess, then. Alright, I can be your guinea pig. I can help you with your paper by using the CodeQL workflow on TinyGL. I assume you're aiming to get it peer reviewed after you've collected data? I will submit my findings to you here.

The work is submitted in its present state (arXiv). As a research result, we hope "CodeQL does pretty well and doesn't require much tuning per-repo and is finding bugs in repos that use Coverity" is sufficiently interesting to our peers :-). But we figured that after writing the paper we should actually try to put CodeQL into the hands of the engineers whose repositories we analyzed -- hence this PR campaign.

I hope you have a nice weekend.

C-Chads / tinygl

Add CodeQL Workflow for Code Security Analysis #29

Summary

What is CodeQL

What this PR does

Validation

Using the workflow results

Is this a good idea?

Review of engineering hazards