Comment on OMB Source Code Policy: 20% should be based on # of systems, not # of lines of code #187

wslack commented 7 years ago

Originally posted by @noahkunin; formatting not preserved:

(I’m Noah, the Infrastructure Director at 18F, an office in the U.S. General Services Administration (GSA) that provides in-house digital services consulting for the federal government. I’m commenting on behalf of 18F; we’re an open source team and happy to share our thoughts and experiences. This comment represents only the views of 18F, not necessarily those of the GSA or its Chief Information Officer.)

The denominator of the 20% requirement should be the number of systems (as defined in the agency’s system inventory) - not the # of lines of code under the agency’s purview.

Consider: if a low-performing contractor to an agency 10,000 lines of open source code to create a capability that a high-performing contractor would have only needed 1,000 lines of code to create, the agency in question is much further along towards hitting the 20 percent goal, but only for a lack of efficiency.

This will result in agencies with low-performing and inefficient contractors releasing less code as open-source, which does not seem in line with the goals of this policy.

The only way to correct for this is to measure released code on the system level. This will also positively impact our ability to evaluate reusability of the code. Simply releasing 5 percent of the code of System A, and 5 percent of the code of System B and so on will not lead to a valid analysis of the policy’s impact.

wslack commented 7 years ago

Subsequent comment by @noahkunin; formatting not preserved:

NIST 800-53 Rev 4 asks all agencies to implement an Information System Component Inventory as a priority one control for all information system impact levels.

As a result, all agencies should have a list of all the information systems that they use (and the components of those systems). Each system is also usually the "unit" on which you grant an Authority to Operate (ATO). Often, this system is also the the unit government teams use to manage financial investments, cost centers, audits, listing high-value assets, etc.

This way of defining a system isn't perfect, but it's currently the closest thing we have to a standardized unit of software technology in government.

Under the recommendation above, for example, if an agency has ten information systems (made up of the standardized components in their inventory under CM-8), they would release the code of two of those systems. This would also help significantly cut down agency overhead in complying with the policy, as the number of lines of code under an agency's purview at any time changes constantly as code gets updated. The number of systems is much more stable, though. Therefore it’s easier to prioritize on the system level which codebase to release.

( 🎩 tip to @brittag and @fureigh for assistance with this response)

wslack commented 7 years ago

Subsequent comment by @konklone; formatting not preserved:

Just to elaborate a bit on @NoahKunin's point: whether or not a formal system inventory is already kept or available in accordance with NIST's guidelines, basing 20% on lines of code is going to be highly difficult in practice.

In addition to the issues described above, the process of mechanically calculating "lines of code" is fraught.

Take 18F's website, which we develop in the open at https://github.com/18F/18f.gsa.gov. Some questions a lines-of-code calculator may need to take into consideration:

Do the Markdown files with blog posts count as lines of code?
Does every line of HTML in each template file count as code?
Is this still true even if some HTML lines are just copy, and some are more instructional?
What if there are extraneous files in the repository that aren't really part of the software per se, but part of its documentation or some other semi-related materials that maybe shouldn't be in the repo but are anyway?

Presuming that lines of code aren't feasible to count manually, these are decisions that automated tools would need to factor in in order to do this calculation.

There are such tools out there that try to approximate lines of code, like CLOC and Linguist. CLOC is a very heavy and complicated tool with many options, and Linguist is optimized for calculating %'s of languages rather than lines of code (for example, it freely counts HTML, which OMB might not want to do). It's unlikely that any tool will meet OMB's needs without sophisticated configuration effort.

Regardless of the tool used -- the tool calculating lines of code must have direct disk access to full copies of all of an agency's source code, in order to search through all the files and count. That's likely infeasible for any individual agency to do at scale -- especially since it must necessarily include all the closed repositories and not the open ones, in order to calculate a percentage -- and would be difficult and expensive for OMB or GSA to do centrally (if possible at all, since it would require authorized access to closed agency repositories).

Measuring 20% by lines of code is likely to create serious engineering challenges, and we think OMB would be better off going with a 20% metric that's based around number of discrete projects, rather than the size of projects. That approach also has problems, but it's flawed in a way that's more clear to understand and measure.

The simplest solution, from both an engineering and enforcement perspective, is dropping the 20% threshold in favor of open source by default, requested by 18F, DHS NCATS, and a number of other commenters inside and outside the federal government. But if a threshold is kept -- please keep it simple.

18F / tts-public-comments

Comment on OMB Source Code Policy: 20% should be based on # of systems, not # of lines of code #187 #16