Required content: How best to measure source code (5.1)

mattbailey0 commented 8 years ago

Section 5.1 of the policy states that:

Agencies should calculate the percentage of source code released using a consistent measure—such as real or estimated lines of code, number of self-contained modules, or cost—that meets the intended objectives of this requirement. Additional information regarding how best to measure source code will be provided on Code.gov

jcastle-zz commented 8 years ago

Or even public versus private repos. Although getting to 100% is interesting, sort of "you don't know what you don't know." We have the same issue with Open Data compliance.

One way we address for Open Data is that there should at least be one data set per major system. Yet, our total data sets do not mirror the number of systems.

I imagine agencies could consider the same with code libraries. There has to be at least one code library for each system (and then some, considering all the mobile apps, websites, browser extensions, hackathon outcomes, etc.).

jbjonesjr commented 8 years ago

I've always been a fan of measuring this requirement by dollars 💵 . That's a metric the Federal Government is already very well suited to track and measure, and often ties easily back into other reporting and documentation structures.

You could imagine in some situations the system being gamed by a small component being open sourced to gain credit for the large contract value. However, if we believe the value of open source software will outweigh the hesitancy within the current process, that the market itself will self correct for this behavior over time. By doing it by dollars, it also gives you the ability to impose more strenuous disclosure requirements on even larger(monetary-wise) projects.

Just my 2 cents.

mattbailey0 commented 8 years ago

You can find a draft here: https://github.com/presidential-innovation-fellows/code-gov-web/blob/master/_draft_content/how-to/measuring-source-code.md

Note that this intentionally doesn't address which code agencies should release at present - it is focused on how to measure source code, tricky enough in its own right.

I had thought about including subsections on each of the measures discussed to help people think them through in a bit more detail. Of use?

mattbailey0 commented 8 years ago

@jbjonesjr thoughts on Github API and Git's ability to help agencies automate this process?

jcastle-zz commented 8 years ago

@mattbailey0 @jbjonesjr we are working on a ruby gem GH webhook that will do the following:

1.) Go through all files in a GH repo and scan the contents for sensitive (pre-selected) terms 2.) Open files in the repo (e.g., zip, dbs, etc.) and scan for terms addressed in #1 above 3.) Create an issue in the repo explaining the issues to be addressed with the particular files 4.) Make a public repo private until the issues are addressed in items #1 to #3 above 5.) Provides a meta-data file (will port to a JSON template...read creates an OSS inventory based on prescribed schema...hmmm sounds very useful for a 12/8 deadline)

Happy to share with all agencies once finished. Finished steps #1 and 3 above, now working on #2. Looping in @jfredrickson5, the awesome coder on GSA Digital Service.

jcastle-zz commented 8 years ago

@mattbailey0 thoughts on measuring 20% based on this md file

1.) Measuring code is a funny thing - always debated in CS classes, IT shops, etc. Is it lines of code, # of files, repos, etc. Looks like you have some good ideas here. Agencies should just be consistent. I think the harder part is to find the 100%.

2.) We are working on automation of meta-data. Good idea to consider automation for aggregation. Will consider with @jfredrickson5.

3.) Would be curious to see some industry/research on this. I am sure there are some sources that could be added to your file for reference.

Note: Great first start! Like the thoughts on this. Imagine agencies will be kicking around for a while.

jbjonesjr commented 8 years ago

@jcastle

1.) Go through all files in a GH repo and scan the contents for sensitive (pre-selected) terms

I want to make sure you have seen CFPB's clouseau. While not quite ready for webhooking, adding that sort of functionality has been on my to-do list for a while now. If you & @jfredrickson5 are doing your work in the open somewhere, I'd be happy to help as much as possible as well.

Maybe some folks from CFPB would be interested in helping too? This would also make for a great asset running on cloud.gov once complete (just sayin').

bandrzej commented 8 years ago

It is common in several industries when you measure code, you are measuring lines of code against another metric.

Static Code Analysis
Code Coverage for Unit & Functional Tests
Code Quality
Source Code Repositories

Frankly, you need a single method/tool to do the measurement. Highly recommend Sonar, which is Open Source unde GNU, to perform the analysis. Things like this should be removed:

Comment lines
Duplicate code
Dead code

As @jbjonesjr said, follow the money. If you leave this open to interpretation, it will get abused.

Disclaimer: These opinions are those of my own - not of my employer.

jbjonesjr commented 8 years ago

@mattbailey0

@jbjonesjr thoughts on Github API and Git's ability to help agencies automate this process?

The GitHub API provides a few different ways to look at this.

The API currently provides the estimated size of the repository on disk. It's estimated because GitHub doesn't store it exactly like a fresh clone of a repo does. However, this data is available and you could then compare "released" file size to non-released file size.
- NOTE: GitHub's newest API, based on GRAPHQL, which is currently in pre-release testing, does not include file size as one of its properties. There is no guarantee that it would ever be included, so caveat emptor.
You can use the Content/Archive api to pull down the repository/branch, and either do file size checks, lines of code scanning, or some other metric in an automated fashion.
You could use the API to list repositories within an organization, and then show which ones are public, and which ones are private. This provides a simple calculation based on repo count.
As @jcastle mentioned above, a webhook, or as @bandrzej mentioned, an integration could be added to a repository that does these checks (either file size, lines of code, or some other metric) automatically on each commit.

I tend to suggest against file size calculation because compression of different VCS systems would be different (you can't really compare a git repo to an svn repo for example), and it can be highly dependent on how and in what order files and actions were taken.

While I am not a fan of Lines of Code calculations typically, being able to strip out comments and because of the scale of this project, it would probably be reasonable. Each project has a dis-incentive to inflate their LOC count (as it would make their own work harder to maintain, and increase their workload). Also, after release, any sort of falsification of count or work would be quickly exposed.

Combine this sort of counting with automated tooling, including the ability to host and manage these tools easily within Government Infrastructure such as cloud.gov or a FedRAMP compliant solution, and you have created a VCS-agnostic, minimal effort solution (assuming cloud.gov can meet the needs of agencies who can not use or access cloud hosted solutions. There are always on-premises solutions for those agencies ).

mattbailey0 commented 8 years ago

@jcastle @jbjonesjr re: clouseau, see also @emanuelfeld's poirot, which is a complete refactor and has some great features.

jbjonesjr commented 8 years ago

@mattbailey0 when looking into Sunlight's repo migration tool (also interesting for this purpose): https://github.com/krues8dr/project-migration, I came across Poirot.

I should have assumed the great @emanuelfeld would have already been all over this. I'd love to find a coalition of the willing who would be interested in combining forces to make a tool accesible from a GitHub WebHook (for real time checking) and as a GitHub Enterprise pre-receive hooks.

Or maybe it's just a set of pre-defined SonarQube checks? I don't want to reinvent too much of the wheel, but Automate All the Things:tm: vs running a script on a locally cloned repo.