GSA / code-gov-web

DEPRECATED 🛑- Federal Source Code policy implementation.
Other
407 stars 110 forks source link

Required content: How best to measure source code (5.1) #25

Open mattbailey0 opened 8 years ago

mattbailey0 commented 8 years ago

Section 5.1 of the policy states that:

Agencies should calculate the percentage of source code released using a consistent measure—such as real or estimated lines of code, number of self-contained modules, or cost—that meets the intended objectives of this requirement. Additional information regarding how best to measure source code will be provided on Code.gov

jcastle-zz commented 8 years ago

Or even public versus private repos. Although getting to 100% is interesting, sort of "you don't know what you don't know." We have the same issue with Open Data compliance.

One way we address for Open Data is that there should at least be one data set per major system. Yet, our total data sets do not mirror the number of systems.

I imagine agencies could consider the same with code libraries. There has to be at least one code library for each system (and then some, considering all the mobile apps, websites, browser extensions, hackathon outcomes, etc.).

jbjonesjr commented 8 years ago

I've always been a fan of measuring this requirement by dollars 💵 . That's a metric the Federal Government is already very well suited to track and measure, and often ties easily back into other reporting and documentation structures.

You could imagine in some situations the system being gamed by a small component being open sourced to gain credit for the large contract value. However, if we believe the value of open source software will outweigh the hesitancy within the current process, that the market itself will self correct for this behavior over time. By doing it by dollars, it also gives you the ability to impose more strenuous disclosure requirements on even larger(monetary-wise) projects.

Just my 2 cents.

mattbailey0 commented 8 years ago

You can find a draft here: https://github.com/presidential-innovation-fellows/code-gov-web/blob/master/_draft_content/how-to/measuring-source-code.md

Note that this intentionally doesn't address which code agencies should release at present - it is focused on how to measure source code, tricky enough in its own right.

I had thought about including subsections on each of the measures discussed to help people think them through in a bit more detail. Of use?

mattbailey0 commented 8 years ago

@jbjonesjr thoughts on Github API and Git's ability to help agencies automate this process?

jcastle-zz commented 8 years ago

@mattbailey0 @jbjonesjr we are working on a ruby gem GH webhook that will do the following:

1.) Go through all files in a GH repo and scan the contents for sensitive (pre-selected) terms 2.) Open files in the repo (e.g., zip, dbs, etc.) and scan for terms addressed in #1 above 3.) Create an issue in the repo explaining the issues to be addressed with the particular files 4.) Make a public repo private until the issues are addressed in items #1 to #3 above 5.) Provides a meta-data file (will port to a JSON template...read creates an OSS inventory based on prescribed schema...hmmm sounds very useful for a 12/8 deadline)

Happy to share with all agencies once finished. Finished steps #1 and 3 above, now working on #2. Looping in @jfredrickson5, the awesome coder on GSA Digital Service.

jcastle-zz commented 8 years ago

@mattbailey0 thoughts on measuring 20% based on this md file

1.) Measuring code is a funny thing - always debated in CS classes, IT shops, etc. Is it lines of code, # of files, repos, etc. Looks like you have some good ideas here. Agencies should just be consistent. I think the harder part is to find the 100%.

2.) We are working on automation of meta-data. Good idea to consider automation for aggregation. Will consider with @jfredrickson5.

3.) Would be curious to see some industry/research on this. I am sure there are some sources that could be added to your file for reference.

Note: Great first start! Like the thoughts on this. Imagine agencies will be kicking around for a while.

jbjonesjr commented 8 years ago

@jcastle

1.) Go through all files in a GH repo and scan the contents for sensitive (pre-selected) terms

I want to make sure you have seen CFPB's clouseau. While not quite ready for webhooking, adding that sort of functionality has been on my to-do list for a while now. If you & @jfredrickson5 are doing your work in the open somewhere, I'd be happy to help as much as possible as well.

Maybe some folks from CFPB would be interested in helping too? This would also make for a great asset running on cloud.gov once complete (just sayin').

bandrzej commented 8 years ago

It is common in several industries when you measure code, you are measuring lines of code against another metric.

Frankly, you need a single method/tool to do the measurement. Highly recommend Sonar, which is Open Source unde GNU, to perform the analysis. Things like this should be removed:

As @jbjonesjr said, follow the money. If you leave this open to interpretation, it will get abused.


Disclaimer: These opinions are those of my own - not of my employer.

jbjonesjr commented 8 years ago

@mattbailey0

@jbjonesjr thoughts on Github API and Git's ability to help agencies automate this process?

The GitHub API provides a few different ways to look at this.

  1. The API currently provides the estimated size of the repository on disk. It's estimated because GitHub doesn't store it exactly like a fresh clone of a repo does. However, this data is available and you could then compare "released" file size to non-released file size.
    • NOTE: GitHub's newest API, based on GRAPHQL, which is currently in pre-release testing, does not include file size as one of its properties. There is no guarantee that it would ever be included, so caveat emptor.
  2. You can use the Content/Archive api to pull down the repository/branch, and either do file size checks, lines of code scanning, or some other metric in an automated fashion.
  3. You could use the API to list repositories within an organization, and then show which ones are public, and which ones are private. This provides a simple calculation based on repo count.
  4. As @jcastle mentioned above, a webhook, or as @bandrzej mentioned, an integration could be added to a repository that does these checks (either file size, lines of code, or some other metric) automatically on each commit.

I tend to suggest against file size calculation because compression of different VCS systems would be different (you can't really compare a git repo to an svn repo for example), and it can be highly dependent on how and in what order files and actions were taken.

While I am not a fan of Lines of Code calculations typically, being able to strip out comments and because of the scale of this project, it would probably be reasonable. Each project has a dis-incentive to inflate their LOC count (as it would make their own work harder to maintain, and increase their workload). Also, after release, any sort of falsification of count or work would be quickly exposed.

Combine this sort of counting with automated tooling, including the ability to host and manage these tools easily within Government Infrastructure such as cloud.gov or a FedRAMP compliant solution, and you have created a VCS-agnostic, minimal effort solution (assuming cloud.gov can meet the needs of agencies who can not use or access cloud hosted solutions. There are always on-premises solutions for those agencies ).

mattbailey0 commented 8 years ago

@jcastle @jbjonesjr re: clouseau, see also @emanuelfeld's poirot, which is a complete refactor and has some great features.

jbjonesjr commented 8 years ago

@mattbailey0 when looking into Sunlight's repo migration tool (also interesting for this purpose): https://github.com/krues8dr/project-migration, I came across Poirot.

I should have assumed the great @emanuelfeld would have already been all over this. I'd love to find a coalition of the willing who would be interested in combining forces to make a tool accesible from a GitHub WebHook (for real time checking) and as a GitHub Enterprise pre-receive hooks.

Or maybe it's just a set of pre-defined SonarQube checks? I don't want to reinvent too much of the wheel, but Automate All the Things:tm: vs running a script on a locally cloned repo.

okamanda commented 7 years ago

@mattbailey0 can we close this?

DanielJDufour commented 6 years ago

Closing due to inactivity

jbjonesjr commented 6 years ago

is there an answer/strategy to this yet?

DanielJDufour commented 6 years ago

@jbjonesjr , good question. I'll reopen.

@jcastle and the team have worked up an automated solution. It's not without limitations, but it will be a significant step forward for us. I'll let Joseph speak more to this when he is ready.

I have also sketched out a machine learning solution to measuring the hours it took to produce the source code in any arbitrary repository. However, developing and implementing the model will have to wait at least a month because of higher priorities.

Let me know if there's anything else I can do to help.

Have a good night.

DanielJDufour commented 6 years ago

@jbjonesjr , I'll also be making the ML model sketch public as soon as its presentable. Would like to hear feedback and thoughts from the community as it's developed.

jbjonesjr commented 6 years ago

very helpful @DanielJDufour . this has been a hot topic since before the days of code.gov, so think it would make a good public document to share what is the current best thinking from those at code.gov.

DanielJDufour commented 6 years ago

@jbjonesjr , @jcastle created an issue that speaks to our current approach to estimating labor hours: https://github.com/GSA/code-gov-web/issues/416

We'll keep you and the community updated as things progress.

DanielJDufour commented 6 years ago

@jbjonesjr , I switched time-estimator to public, which you can view here: https://github.com/GSA/time-estimator

All are welcome to suggest features (aka factors or inputs) that we should look at when using an ML model to estimate the time it took to produce source code. We won't be able to get to it for a couple months because of competing priorities, but figuring out the factors as a community before building the ML model would be very helpful and speed up development (when it does happen).

time-estimator is really just about estimating time, but we're of course open to initiating, using or joining in on other projects that measure code-quality more generally. Fitting all the nuance of code quality into a scorer (e.g., linear model with weights) will be difficult, but will be more objective and allow people to customize their weights when there's not consensus. I hope this makes sense and open to everyone's thoughts :-)