Comment on OMB Source Code Policy: [Request for discussion] How agencies should inventory their software #116

Originally posted by @konklone. Formatting not preserved here.

(I’m Eric, an engineer at 18F, an office in the U.S. General Services Administration (GSA) that provides in-house digital services consulting for the federal government. I’m commenting on behalf of 18F; we’re an open source team and happy to share our thoughts and experiences. This comment represents only the views of 18F, not necessarily those of the GSA or its Chief Information Officer.)

The Implementation section of the policy asks agencies to inventory their open- and closed-source software projects, so that OMB and the public can increase the discoverability of agency software. This seems very similar to what agencies do with their datasets, in support of M-13-13 and Project Open Data.

Many agencies don't have existing inventory processes in place, and agencies manage their enterprise data inventories in a variety of mostly manual ways. Our experience with these data inventories is that they are often out of date and incomplete.

Given that, we think cognitive simplicity and automation, for project owners and agency staff managing inventory data, will be key to getting complete and timely inventory data. In other words, we should make lives easier for publishers, even at the cost of inconveniencing consumers, so that consumers end up with better overall data.

There are a variety of ways you could accomplish this. We describe a couple ways below, but would really love discussion on what the best way to achieve this is.

One way is to have agencies list the places that their software projects can be found, rather than a single list of all of their projects. These places would be expected to have a machine-readable way to list those projects -- they could be GitLab or GitHub accounts, or RSS or Atom feeds maintained directly by an agency, or an OMB-designed schema.

This resembles how sitemaps work today, where your initial sitemap may just be an index of links to other sitemaps.

One simple way to represent an index like this might be:

{
  "hosts": [{
    "url": "https://github.com/18F",
    "format": "github"
  },{
    "url": "https://code.gsa.gov/feed/",
    "format": "rss"
  },{
    "url": "https://code.cio.gov/repos.json",
    "format": "omb"
  }]
}

OMB or the public would then need to "walk" each of these places using format-specific adapters that use (in the above example) the GitHub API, an RSS parser, and an OMB-specific parser in turn. (In practice, the only way to inventory closed-source projects would be agency-hosted data, not via GitHub -- so closed GitHub repositories would need to be inventoried elsewhere.)

This approach has some clear limitations. OMB and the public would be limited by the data fields available in the software systems used by agencies, and they would have to employ a more sophisticated system in order to discover every project across a variety of formats. There's also some additional complexity inherent in using a "two-tiered" inventory system, as compared to simply having agencies produce a single large list of repositories.

However, this would reduce the burden of creating new open source projects on agencies and their developers to essentially nothing, and would reduce the burden of agency inventory maintainers to only documenting closed source work. This is proportionate to the level of ease and fluidity that OMB should want agencies to have regarding open source code, and would be an acknowledgment by OMB that they don't want the inventory process to be a major burden to agencies. In addition, the burden of using format-specific adapters to walk different services could be mitigated if the tools OMB uses to walk agency inventories are made open source and straightforward for others to use.

Alternatively, OMB could ask agencies to provide a single simple JSON list of all open- and closed-source software projects, but provide tools (potentially simple in-browser tools) to help agencies make use of their existing GitHub/GitLab/RSS feeds to generate that single list. By comparison to the above, this approach would make OMB's life (and the public's life) easier when walking over agency inventories, but would add the burden to OMB of providing maintained tools that help walk different feeds of software projects. Agency inventory maintainers would have to do more work when updating their inventories, though not at any additional frequency.

Originally posted by @philipashlock; formatting not preserved here:

(I'm Phil Ashlock, the Chief Architect at Data.gov which is also operated within the U.S. General Services Administration (GSA). As part of my role at Data.gov, I've worked with OMB and agencies to shape and implement the Open Data Policy and maintain Project Open Data (especially the metadata schema) which this policy has been heavily modeled on. This comment represents only my views, not necessarily those of the GSA or its Chief Information Officer.)

The approach described by Eric is a hybrid between an inventory and a list of existing inventories. While it makes it clear that agencies will need to implement an inventory mechanism to document their closed source projects, it suggests agencies won't need to worry about the inventory process for projects that can be automatically inventoried by systems like GitHub.

We should remember that systems like GitHub don't magically create the metadata we'd want to include in these inventories. This information does need to be updated and maintained just as manually as if it was entered into any other system. While one could argue that this hybrid or bifurcated process reduces burden on the agency, I think you could argue that the amount of work to enter and maintain the primary source of information is the same, but it prevents the agencies from actively engaging in the management of a complete inventory. This approach prevents agencies from creating a usable, comprehensive, and complete inventory of their own software projects which would help them better plan internally, keep track of ongoing work, and avoid duplicated effort. Instead it asks OMB or the public to assemble a complete inventory for the agencies as if they shouldn't be bothered to manage or make use of such an effort. This is not meant to be a compliance exercise and it's not just for the public benefit. At it's core, this process should help agencies understand what they're doing in a more holistic way and make decisions accordingly.

This hybrid approach also asks OMB or the public to be responsible for adapting these separate inventory systems. This means we will build tools and process around these separate inventory systems rather than work toward a common standard that would actually make it easier to automate the flow of the information from the source of data entry into these cohesive agency-wide inventories. Ironically this means we're effectively promoting a strategy of vendor lock-in for a policy focused on just the opposite.

Both the open data policy and this open source policy require this inventory metadata to be entered at some point and the challenges with the manual data entry cited for the open data policy will be just the same with this policy any way we implement it - hybrid or not. With the open data policy there were some agencies that already had effective data management and inventory processes in place and the policy simply meant outputting those workflows with the metadata standard established by the policy. However, many agencies had no such process in place, so the policy forced them to establish that metadata entry for the very first time. It will be just the same for this policy. While we're fortunate that code management platforms are seeing wider use across government, there are surely numerous projects, both open and closed, that will either need to be migrated to such a platform or have this metadata entered manually for the first time to meet the goals of this policy.

I wholeheartedly agree that we should automate the management of this metadata as much as possible, but agencies will need to be responsible for ensuring this happens and we should work toward common standards to make the process as streamlined and brainless as possible from any source. Ultimately agencies will still need to actively think about what makes up their inventory. Otherwise, we're not asking them to take advantage of the practical benefits of open source and see what's already out there within their own organization or document anything well enough for others to do the same.

I highly recommend the alternative approach Eric suggested. Agencies should be producing complete comprehensive inventories and OMB should help ensure this process can be as streamlined as possible when taking advantage of existing platforms. We should also be engaging platforms like GitHub to ensure their API and the approach to automatically generate assets like README files can help feed this standardized metadata process as well. We took this approach with the Open Data Policy, working in the public with the broader community, and now almost all the major data inventory systems implement the same open standard. Much of the credit for that strategy and architecting Project Open Data in general goes to @benbalter :)

Originally posted by @jjediny; formatting not preserved here.

I'm John Jediny, Chief Data Engineer at Data.gov working with @philipashlock +1...

I'll add:

The simplest approach I agree would be to start with single README (using a YAML/MD format) to be posted on any publicly accessible website and/or git repo like (github/gitlabs/bitbucket) that can be periodically pinged via a registered URL/URI to a central catalog. These same files if they implement a established shared core schema could to be used as a single entry in/as a collection of entries to be rendered via static website which can also be used to generate a consolidated JSON file (ex. as a_collection or _data folder/file in Jekyll). As YAML/JSON are interoperable formats, you can compile many YAML files into one JSON file, conversely you can decompile and breakdown JSON into many YAML files. Both of these formats are the basis for much of the modern code configuration/automation and/or as a dynamic or static API(s) because they are one of the few formats that can contain a 1 to many (or nested hierarchy) within a single flat file.

I suggest the Project Open Source team adopts a similar approach to the distributed generation model of Project Open Data; with centralized validation and cataloging. Here are some related efforts that highlight a similar approach we are attempting to register new data:

Jekyll based data catalog that generates data.json Jekyll page generator based on a json schema, that can be extended and have pick-lists provided through an extending schema, it is import that any standard be vetted tested and ideally approach from a metadata (YAML/MD/JSON file using a common/shared standard) and project management (i.e. git repos) approach. https://github.com/project-open-data/datajson-editor https://github.com/JJediny/json-editor https://github.com/18F/about-yml-editor This approach provides for the most amount of options and highest level of interoperablity:

Post the YAML/MD file on any public website, register, ping. Add the YAML/MD file to a static site, compile into a single JSON file that can be harvested/parse/merged. Use existing repos, CMS, etc. to map/extend their data model to conform to a common standard. However this is all predicated on establish said standard/spec/schema per #117

Holding this open if people want to copy their original markdown so the formatting is better.

I updated mine to copy my original markdown from https://github.com/WhiteHouse/source-code-policy/issues/116. @philipashlock?

18F / tts-public-comments

Comment on OMB Source Code Policy: [Request for discussion] How agencies should inventory their software #116 #13