theresaanna commented 7 years ago

Part of the Federal Source Code Policy requires that federal agencies make available an inventory of metadata describing their custom software. We’re exploring ways for agencies to provide their inventories. We want to implement a solution that works well for agencies and we need your help to do that.

The Federal Source Code Policy describes code.gov as “the primary discoverability portal for custom-developed code intended both for Government-wide reuse and for release as OSS.” The inventory data that agencies provide will be made available through code.gov. The data we collect should make it possible for agencies to find projects relevant to their needs.

There are two primary areas we see where decisions need to be made: the data format and what data is collected.

Data Format

The two options we are considering are CSV and JSON. The assumed benefit to a CSV-based approach is that it is easier for agencies to create and maintain a CSV than JSON. With this approach, we might create a system for agencies to submit their inventory CSV. With a JSON-based approach, we might ask agencies to make the “inventory.json” available on their website and we would have a system to retrieve inventories as they change. One drawback to JSON is that it is more effort to maintain, takes specialized knowledge, and we may need to provide a tool to build the JSON. On the other hand, JSON is easy to work with programmatically and matches what Data.gov does, meaning many agencies have some familiarity with the process that inventory updating would entail.

The unanswered questions on data format are:

Which data format is the best fit: CSV or JSON?
Is it best to retrieve or ask agencies to submit their inventories?
Collected Data

In either data format, we need to determine what data we will collect. Below is a list of fields we are considering accepting.

Proposed required fields:

Project Name: The name of the project
Project Description: A description of what the software does
Point of Contact Email Address: Email address for the project point of contact
Proposed optional fields:
Version Control System: the VCS that the project uses
Repository URL: The URL of the upstream project repository, if applicable
Project URL: The URL of the project homepage
Project Tags: Tags that will help Code.gov users find projects based on their needs
Languages: Which programming languages are used in the software
Last Updated: Timestamp of when the project codebase was last updated
License: The type of license that the source code of the project is released under
Open Project Status: Whether or not this project is open source
Government-wide Reuse Project Status: Whether this project is designed for reuse across government
Exemption: Which exemption, if any, is being relied upon for keeping the source code private

For an idea of what the data might look like, we have an early draft of a schema with example content: (https://gist.github.com/theresaanna/a82bfb39b64362bca04e4644706b0ce4)

The questions that we are looking to answer here are:

Is this the right information to collect? Have we missed something?
What should we consider on the part of agencies when implementing this data schema?

Thanks for your feedback! It’s crucial for us in meeting our goal of providing a system and schema that are easy to use and meets agencies' needs.

theresaanna commented 7 years ago

I've asked a handful of developers here at 18F for some feedback on approach and schema. Here are some highlights:

Projects that have instances per-agency, like the eRegulations platform, will be duplicated in the directory.
We should provide a way for agencies to submit tarball or other packaged software URLs.
In the draft schema, we should make the names clearer and avoid abbreviations like "openPjct", "govwideReusePjct", and "closedPjct", opting to instead spell out "project".
"pjctTags" should be an array.
"license" may have multiple values
It could be helpful to have some searchable metadata as indicators of overall project health (# contributors, alpha/beta/prod status, etc.)
We may be able to remove "closedPjct" because the presence of "exemption" would indicate that it's a closed project.

mgifford commented 7 years ago

Would be interesting to be able to record if projects:

Meet with Section 508 (WCAG 2.0 AA) requirements
Support multi-lingual content (English/Spanish) and interface
Have undergone external security reviews
Are actively maintained or have critical mass
Include screenshots or other documentation

That being said, good to make it as easy as possible for folks to get started. We don't want to overwhelm folks.

rossdakin commented 7 years ago

I firmly believe that the proposed schema should include a standards-compliant 3.5mm headphone jack.

rossdakin commented 7 years ago

But seriously, some thoughts:

Data Format

Some thoughts on proposed data format standards for agency publication (code.gov consumption).

NOTE: a related but distinct feature of code.gov should be the publication of its aggregated inventory. There may be value in providing this inventory in many formats (expecting many varied consumers), whereas below I advocate for a single data format (expecting code.gov to be the sole consumer).

JSON

This should be the standard, IMO, for all the reasons that JSON has become popular: easily readable, ubiquitous, expressive (i.e. allows for collections (arrays) unlike CSV), and libraries exist in all major languages/platforms for JSON generation.

If only one format is supported, I suggest it be JSON.

XML

This wasn't mentioned, but is ubiquitous enough to warrant discussion. As I see it, JSON can do everything XML can do while being more readable and easier to construct and less complex to define (no WSDL, etc.).

If the schema were intended for broad consumption, I might suggest discussing XML support, but seeing as the schema is primarily intended for consumption exclusively by code.gov, I don't think the added complexity yields much additional benefit.

CSV

I don't see any benefits to supporting CSV, which lacks support for expressions like multi-dimensional collections (i.e. arrays) beyond the single dimension of the rows in a CSV table. One could hack around this constraint by supporting dynamic column headers (e.g. tag_1, tag_2, ... tag_n) or implicitly enumerated columns (e.g. tag, tag, tag -- similar to how some web frameworks handle array POST value). The same could be done for nested attributes (e.g. contractor_1_contacts_contact_2_phone) but that's incredibly inelegant.

One could argue that CSV is simpler to publish when maintaining an inventory by hand (e.g. by exporting an Excel spreadsheet). While this is true, I don't think that benefit outweighs the inherent limits of the format. It also seems that in the long-term, we would want agencies to programmatically generate their inventory file rather than hand-crafting it manually; not supporting CSV may nudge them in the desired direction.

YML

For the sake of completeness -- not mentioned above, but worth discussing. Same attributes as JSON but somewhat more human-readable and somewhat more fragile (white space dependency). I don't see a benefit to supporting YML in addition to JSON.

Collection Methodology

Is it best to retrieve or ask agencies to submit their inventories?

Pros and cons either way. A "pull" methodology seems to be the simplest (avoids "push" credential checking, account maintenance, etc.; also puts burden of initiation on code.gov centrally rather than on each agency individually).

One benefit of a "push" methodology would be more real-time reporting, though I'm not convinced that real-time reporting is very important in this project or outweighs the additional complexity.

CRUD

It's also worth talking about how specific actions should be taken and how certain situations should be interpreted.

For example, if a record suddenly stops being included in a agency's reported inventory, what does code.gov do? Delete it? Ignore the omission?

Should code.gov assign unique identifiers or require them as a part of inventory submission (to avoid duplication and enable "upserts")?

Which inventory actions should be idempotent?

Etc.

Collected Data

1,000% agree with @theresaanna on fully spelling out field names rather than using abbreviated/Hungarian-like naming.

Relationships / Reuse

If one agency does start using code from another agency, how is that represented in the code.gov data model?

ctubbsii commented 7 years ago

If data entry is provided, then the format CSV or JSON doesn't matter, because the view can be exposed either way. The format does matter for bulk-import of metadata, and for that, I'd prefer JSON.

I think it's best to ask agencies to submit their inventory to code.gov (this is where that bulk-import feature helps), rather than rely on them to publish on their own site and pull from there (not all government agency's have up-to-date and convenient sites, and if you provide the platform for receiving the information, it'll probably be easier and faster to get the data than requiring them to sustain their own platform for publishing). Some incentive should be provided to ensure project managers submit this data. Using the data to have a "featured projects" page, might be one way to incentivize timely submissions.

As for fields,

Project URL should be required. It's the single most important piece of information. Everything else can typically be discerned by visiting that URL. If it doesn't have a URL, I'm not sure why it would ever be listed here (unless the intent is to publish metadata about closed-source projects).
POC Email should not be required, because sometimes, the best way to contact an open source project is through the forums, not directly. Additionally, email as a mandatory method of communication is not really future-proof.
Last Updated is a confusing field. Does it refer to the last commit? The last time a user reported a bug? The last mailing list discussion? The last time the metadata was updated? If it's going to be there at all, it should refer to the last update of the metadata. It's not reasonable for projects to update this metadata field every time the project itself has activity, so it's pretty useless for that purpose. It would, however, be useful to see if the metadata is old or not. If it gets used this way, it should be required.
License should be required. This is a pretty important field.
Open Project Status is confusing. Is this metadata intended to index non-open source projects as well? Even if it is, this raises the question of "whose definition of open is being used?". This is also redundant with the License field, because the status is determined solely by the license.
Government-wide Reuse Project Status is also confusing. Why would this ever matter? An agency's intention that their release be reused across government has no bearing on whether or not it will be.
Exemption field may not be useful. I imagine that most things that would exempt it, would also exempt the metadata being requested. As an optional field, I guess it's fine, but it's probably better to simplify things and elminate fields, until a demonstrated value exists. Best to start small and grow bigger, than to start big, and just grow more complex.

niden commented 7 years ago

Adding to what @ctubbsii wrote:

Last Updated could be renamed to something like Updated and become an array that has more information in there such as LastCommitDate, LastMetadataUpdate, LastPullRequest etc.

The Languages should be an array not a comma separated field. It will be easier to index that way IMO

I see little value in supporting CSV or XML. As @rossdakin points out, not offering CSV will point people to the right direction :)

bondsbw commented 7 years ago

Government approval processes often become roadblocks and cause systems and data to become stale and unreliable for their purposes. I fear the same for this effort. As red tape is added, this data could become so dated that nobody finds it useful.

I suggest that Code.gov needs to get in front of this problem before the culture settles. Encourage agencies to push metadata updates as quickly and as often as possible while reducing red tape in these processes. Make the update process responsive by eliminating any approval processes aside from standard security and authorization measures.

I would hate to see all this effort reduced to the usual "I technically did my part" checkbox I find in too many government tasks.

jasonduley commented 7 years ago

Which data format is the best fit: CSV or JSON?

We prefer a JSON based serialization as we have tools inside of NASA to support both open data and internal code sharing that operate seamlessly with JSON. Also, it should be allowable for agencies to extend the base schema to append additional attributes and this is not possible or easily done with CSV.

Is it best to retrieve or ask agencies to submit their inventories?

We would like to mimic the scheme that data.gov uses and post the JSON file on a web server and have it harvested at some interval. Ideally, we'd like to have access to the harvest job admin screen in order to run it manually as well as a dev harvester for performing end of quarter batch loads

As I mentioned on our call today, since the majority of our code is behind the NASA firewall, it would reduce perceived risk and increase NASA's adoption of this policy if URLs are optional. Of course for open source repositories such as the ones NASA maintains here: www.github.com/nasa, we would include the URL fields as they are important in this context. I think title, description and POC are all important for code discovery and setting up potential collaborations between government parties

Schema comments I have, within the Projects array ... VCS should be typed ENUM (avoid confusion between SVN and Subversion for example) pjctTags should be typed an array of strings codeLanguage should be typed an array of strings and potentially ENUM to avoid terminology mismatches (node vs node.js) POCemail should be replaced with POC of type object similar to ... POC: { email: "jason.duley@nasa.gov", name: "Jason Duley" } boolean fields should be true/false

also, from a schema standard we should decide if attributes should be included with NULL values OR if those NULL valued attributed should be omitted.

ddelmoli commented 7 years ago

If considering a JSON format, it may be useful to follow / look at the npm package file format https://docs.npmjs.com/files/package.json

RobertRM commented 7 years ago

And Git Hooks would be a good way to submit this information while pushing to github for projects hosted on that platform.

http://githooks.com/

IanLee1521 commented 7 years ago

Personally, I would prefer not XML for the reason that it isn't as well supported by tools like Jekyll which may be used for the display / web visualization of the data.

Another thought, should the fields / the spelling of the fields be aligned with the type of information that can be grabbed from sources like the GitHub REST API?

This would allow, at least for open / GitHub repos, the ability to absorb all projects by only knowing the organization names. This is something that I am doing for the @LLNL organization to create a software portal, much like what Code.gov will become, at software.llnl.gov.

IanLee1521 commented 7 years ago

Oh, and I also agree with @jasonduley that the ability for agencies to push into the repository would greatly ease the integration of "inside the firewall" code hosting.

jbjonesjr commented 7 years ago

I want to take a second before responding myself to thank @rossdakin for his detailed post above. He did a great job laying out reasoning behind multiple formats and each delivery mechanism. Thank you for taking the time to share that and add to the conversation.

Now, some thoughts in no particular order....

It would be really nice if tags could be a fixed set instead of freeform. I'b be curious if StackOverflow published a list of it's top tags that could seed this project. While rejecting data for incomplete tags is not optimal, dealing with multiple tags that mean the same thing "Subversion, SVN, svn" can make discovery very difficult.
As @IanLee1521 mentioned, this metadata should be derived wherever possible (the GitHub API is great for this). Creating an API process to iterate each repository in the organization, collect the proper information, then "push" it to either code.gov or the agency website would be a pretty low lift (for systems already collecting that information at least).
For @jasonduley,

it would reduce perceived risk and increase NASA's adoption of this policy if URLs are optional.

Can you tell me more of how NASA treats internally-resolvable urls as a risk? I'd think as the govt works towards more inner-sourcing and reuse, that being able to go "to" the code will be a big help.

As we talk about push vs pull, keep in mind what happens when projects are abandoned? Who updates this information then (or if someone wants to register a long-abandoned project)? Might it be helpful to include "expected project period of performance" to provide a hint that a project might be OBE at a later point in time?
Regardless of ingest format, couldn't code.gov due the translation and offer data in many formats? Very Write Once, Read Many...
In terms of push vs pull, would you require the parent agencies website to publish the inventory? So all the DOE labs would be required to roll up to DOE? How about multi-agency organizations? I think it would simplify the data flow if push was decided as the standard. The only issue there is you lose some agency capability to keep on top of the inventory.

jasonduley commented 7 years ago

@jbjonesjr Today, mission-based CM systems that contain flight code, vehicle commands, ground software, etc. and other sensitive projects are not going to allow a firewall exception to government partners and will most likely share "released" source code by re-hosting to neutrally located CM systems outside the NASA internal firewall for government-wide sharing. For the inventory, the URL should be optional for internal source as they live behind the firewall.

IanLee1521 commented 7 years ago

@jasonduley -- Would providing the links, even if they are inaccessible be an issue? It seems like if it were possible to provide the where now, that would assist with identifying where new connections need to be established.

@jbjonesjr -- One other thought is that the number of sources for the metadata we (all) would be scraping is fairly limited... There are only so many tools for hosting code. GitHub.com obviously, but also: GitLab, Bitbucket.org, Bitbucket Server, SourceForge, etc. By deciding on a common format and building tools for scraping that data out of these sources, all of the agencies would be able to contribute collaboratively.

jasonduley commented 7 years ago

@IanLee1521 I think supplying URLs to NASA's internal and tightly secured code projects will cause issues for us. Please note this would be a subset of projects in the inventory and all already released open source would contain URLs.

IanLee1521 commented 7 years ago

Makes sense... For what it's worth, I suspect we would have similar issues @LLNL.

bbrotsos commented 7 years ago

Collected Data

Code.gov should reuse data element names and definitions from project open data metadata schema https://project-open-data.cio.gov/v1.1/schema/ where possible. These are based on W3C http://www.w3.org/TR/vocab-dcat/ and dublin core that has been around for many years. Alternatively, if GitHub, GitLab or other code repository has existing data elements and types, this project could use those fields. Code.gov could reuse the following fields from project open data:

title
description
keyword
contactPoint
publisher
license
landingPage
identifier

An example:

{
     "projects":{
          "title": "Important USDA Code Repository",
          "description": "Creates new automated farms",
          "landingPage": "usda.gov",
          "repositoryURL": "github.usda.gov/automated-farms"
          "softwareLanguage": ["ada", "perl", "cobol"]
          ...

There may be more fields to reuse. I also recommend adding fields which will be good for analytics of what agencies and investments are releasing their code:

bureauCode
programCode

By aligning to these fields names, there is also hope in developing common system for storing data sets, data assets and code repositories. For example, we could potentially create an extension for CKAN or DKAN to also store code repositories. You could also reuse existing documentation.

rough68fish commented 7 years ago

I think it would be a good idea to follow the process established by the data.gov effort as much as possible. Since most agencies have been working on setting that up they should be familiar with json and have processes for creating and maintaining the json data.

Also try not to invent a whole new schema and if possible try to reuse data.gov data descriptions where you are talking about the same thing.

theresaanna commented 7 years ago

That being said, good to make it as easy as possible for folks to get started. We don't want to overwhelm folks.

@mgifford I agree that we should make it easy for folks to get started, but you bring up some valuable data points we might collect. Thanks so much for your feedback. A question that remains for me is whether it's better to have an initial version of the schema that we add onto as agencies feel more comfortable or if it's better to be thorough up front.

thecapacity commented 7 years ago

@theresaanna I think you've got a lot of good material in the above discussion, and may have already seen this from some of my colleagues: https://18f.gsa.gov/2016/08/29/data-act-prototype-simplicty-is-key/

"... One of the earliest decisions our team grappled with centered on the data format we would receive from agencies. ... "

I wanted to augment some of the earlier comments that it definitely seems like an "and" and making one machine-readable format is a good way to validate another (e.g. CSV to validate a "more formal" JSON/XML/... spec).

theresaanna commented 7 years ago

@rossdakin Thank you so much for your thoughtful feedback! You've brought up some great food for thought. I am in agreement with you that a JSON, pull-based system makes the most sense. Some thoughts:

For example, if a record suddenly stops being included in a agency's reported inventory, what does code.gov do? Delete it? Ignore the omission?

My assumption is that code.gov always reflects the most recent version of agency inventories, meaning we'd delete the record. I don't know if this is a good assumption. Are there cases in which we'd want to hold onto old data? I imagine it'll be normal for software to drop out of inventories as it becomes replaced.

Should code.gov assign unique identifiers or require them as a part of inventory submission (to avoid duplication and enable "upserts")?

You bring up a great point. I think that for a first version, given the aggressive timeline the policy lays out, we won't be able to tackle this, however, I will add it to our backlog for addressing in the future. I cringe a little to say that, as this is something we'll want to think about sooner rather than later admittedly.

If one agency does start using code from another agency, how is that represented in the code.gov data model?

That is a fantastic question! I think we will need to have some discussion around how we might represent that - whether it's in the data model or a layer on top of it. Do you see any benefits to having it in the data model?

CynthiaParr-USDA commented 7 years ago

Because new code is often generated in association with research data, we are encouraging data submitters to the Ag Data Commons (https://data.nal.usda.gov) to also submit a pointer and metadata description for their software (which we hope is primarily managed in an open source code repository). Two points to make about this: 1) We have the same POD 1.1 metadata for the software (which we have augmented with a few fields -- see https://data.nal.usda.gov/description-fields-%E2%80%9Cedit-dataset%E2%80%9D-page) 2) We obtain DataCite DOIs for software tools, whether they are registered separately from their data or included as a resource in a data package.

I would encourage processes to align as closely as possible with the existing open data.gov processes. I have no problem with additional value-added metadata.

jecb commented 7 years ago

Apologies if this question has been asked, but has there been discussion around creating a JSON conversion tool similar to the DCOI Strategic Plan: https://datacenters.cio.gov/json-conversion-tool/?

okamanda commented 7 years ago

@jecb and others have brought up making this a tool or process to make generating the code inventories as easy as possible. I think the first step in doing so, is mapping schema fields to some of the web-based repo hosting tools (e.g., github,bitbucket), especially those that have APIs.

To that end, I've put together this table which shows what this might look like.

schema field	github field	bitbucket field
agencyAcronym	given	[given]
projects.vcs	[git]
projects.repoPath	[html_url] or [url]
projects.repoID	[id]
projects.projectURL	[homepage]
projects.projectName	[name] or [full_name]
projects.projectDescription	[description]
projectTags.tag	(?) process/analyze from [description], [name], and [language]
codeLanguage.language	[language]
Updated.LastCommitDate	[updated_at]
Updated.LastMetadataUpdate	[pushed_at] or [updated_at]
Updated.LastPullRequest	grab [updated_at] from [pulls_url]
POCemail	(?)
license	(?) grab/process/analyze from LICENSE.MD/README.MD, etc.
openproject	1
govwideReuseproject	0
closedproject	0
exemption	null

VisionPaul commented 7 years ago

Collected Data Adding the name of the system or platform may help for purchased environments that allow for custom solutions to be developed within. We use both Salesforce and ServiceNow - and many other agencies are using these platforms as well, and it would be great to search and post developed solution sets - especially since they probably already come with some level of A&A.

Maybe "softwareLanguage" as @bbrotsos has listed above would be appropriate usage for this example....

theresaanna commented 7 years ago

@ctubbsii Thank you so much for all of your feedback. You've brought up some great points that are so valuable in helping us think this through. I've replied to much of your comment inline:

Project URL should be required. It's the single most important piece of information. Everything else can typically be discerned by visiting that URL. If it doesn't have a URL, I'm not sure why it would ever be listed here (unless the intent is to publish metadata about closed-source projects).

So, we will be collecting data about presumably many closed-source projects, and so a public URL may not be available.

POC Email should not be required, because sometimes, the best way to contact an open source project is through the forums, not directly. Additionally, email as a mandatory method of communication is not really future-proof.

I agree that it's not very future-proof. My assumption was that agencies would need a way to get in contact with the project maintainers if this inventory were to be useful. However, I'm not sure that's a good assumption. I'm planning to remove it as a required field unless a good argument is made to the contrary.

Last Updated is a confusing field. Does it refer to the last commit? The last time a user reported a bug? The last mailing list discussion? The last time the metadata was updated? If it's going to be there at all, it should refer to the last update of the metadata. It's not reasonable for projects to update this metadata field every time the project itself has activity, so it's pretty useless for that purpose. It would, however, be useful to see if the metadata is old or not. If it gets used this way, it should be required.

Interesting. I had thought of this field as a signifier of the activity of a project, but this would be hard to maintain unless we were pulling project info right from Github or similar. I agree that it would be useful to see when the metadata was last changed. The more I think about this field, though, the less convinced I am that we need it. Until we have a tool to generate the inventory JSON, I imagine this field will fail to be updated with changes, making it unreliable.

License should be required. This is a pretty important field.

Agreed that it is important, but unfortunately not all software will have a license. Along the lines of the suggestion you made about incentives, perhaps there's a way to encourage folks to release code and help them decide which license is right.

Government-wide Reuse Project Status is also confusing. Why would this ever matter? An agency's intention that their release be reused across government has no bearing on whether or not it will be.

There are projects that are built as platforms or to be reused specifically. For example, the eRegulations project. https://eregs.github.io/. This field will allow users to look specifically for these types of projects.

Exemption field may not be useful. I imagine that most things that would exempt it, would also exempt the metadata being requested. As an optional field, I guess it's fine, but it's probably better to simplify things and elminate fields, until a demonstrated value exists. Best to start small and grow bigger, than to start big, and just grow more complex.

Definitely agreed on preferring to start small. @okamanda or @mattbailey0 - I realize I don't fully understand what it means for a project to be exempt. Is this exemption from the open source part of the policy? A related question: If you look at the original schema, it has exemption and closedPjct. Will a closed project always have an exemption? Put another way, can we remove closedPjct and rely on the existence of exemption to indicate that it's closed?

theresaanna commented 7 years ago

Last Updated could be renamed to something like Updated and become an array that has more information in there such as LastCommitDate, LastMetadataUpdate, LastPullRequest etc.

The Languages should be an array not a comma separated field. It will be easier to index that way IMO

@niden these are great suggestions, thank you. I will implement your languages field suggestion - I agree. In the interest of ease of use, I'm thinking we may want to drop the Last Updated field. Though, if we do implement it in the future, this object-based approach would make things clearer. I could see this field being more useful when we provide a JSON generator or can pull data from somewhere like Github.

@okamanda, I'm interested in your thoughts here. Do you see a need for Last Updated that I don't? I worry that it will fail to be updated and then become unreliable data if folks are updating it manually.

okamanda commented 7 years ago

Re: exemption/closedProject

@theresaanna @ctubbsii - Exemption/closedProject reflects the scenario when an agency cannot report the details about a particular code inventory and are relying one of the five exemptions provided for in the policy (e.g., national security risk).

In this scenario, agencies would have to remove or 'blackline' certain fields like projectURL, repoPath, and others (to be determined) prior to publishing/posting their code.json inventory. In addition, they'd have to state the exemption upon which they rely in the 'exemption' field. That way when the code.json inventory is published with the missing fields, the public will be able to see the reason those fields are missing in the exemption field.

I could see an argument that if the exemption field is null, then closedProject should be 'false'. But in any event, the exemption/closedProject fields are closely related.

okamanda commented 7 years ago

Re: Updated, LastUpdated, LastCommit, etc.

@theresaanna @niden I think the value of having some form of timestamp updated field is to show currency. Stale, unattended, outdated code is not particularly helpful to the open source community or the federal agencies. There's merit in giving developers a sense of how much attention a certain repository receives.

Difficulties in implementation notwithstanding, these fields or something like them will be important piece of data in evaluating whether or not to rely on a segment of code.

A good question for us to consider is whether there is an easy proxy that we can use. For repos on popular hosting platforms like Github or Bitbucket, we have a few fields from which to choose. For other private or more obscure repos, this becomes a bit more difficult to find that timestamp. Difficult, but not impossible given that timestamp changes to code is the major reason why we have version control software!

So I'd love to hear from folks here who've wrestled with this problem in some form or another. Are there creative ways to get timestamp info on:

date created
date of last commit
date of last push
date of last modification generally

?

ctubbsii commented 7 years ago

@theresaanna wrote:

...not all software will have a license...

I still think the License field should be required, even if the value is None, Not Available, Not Publicly Licensed, or similar. That way, nobody is left wondering.

afeld commented 7 years ago

I admittedly haven't read the whole thread, but wanted to drop a few links for existing schemas that might be worth looking at:

NoahKunin commented 7 years ago

Obviously not going to self-deal here, but I can represent the position from the Technology Transformation Service front-office that we'd like to go with About YML, or at least some kind of YML...

...beyond my own personal preferences, I simply think it will be more accessible to people in Product roles who want to keep this data up to date.

neilmartis commented 7 years ago

@NoahKunin @afeld @theresaanna I found this created by a former PIF Rob Baker not sure if this can be added to the list by Aidan https://github.com/rrbaker/maker.json

neilmartis commented 7 years ago

maker.json is a schema to promote standards in the information we share about DIY spaces around the world toward fostering further awareness and improving collaboration.

IanLee1521 commented 7 years ago

In this scenario, agencies would have to remove or 'blackline' certain fields like projectURL, repoPath, and others (to be determined) prior to publishing/posting their code.json inventory. -- @okamanda

Interesting... So the intention here is that the code will be acknowledged and named, with a note that it is exempt?

jasonduley commented 7 years ago

hello everyone, I had a few questions I received today during discussions with some NASA stakeholders:

Q1) Does the code inventory we post exclude software the agency has developed prior to Aug 8th and only include "new code" projects formulated after Aug 8th OR Is the list exhaustive in that all software must be accounted for?

I think I know the answer but wanted to document the question so I can pass it along to some folks within the agency (this question stemmed from the "is not retroactive" part in the document)

Q2) Should NASA or other agencies include code projects written via hackathons/challenges (e.g. spaceappschallenge), grants, proposals as part of the inventory?

ctubbsii commented 7 years ago

Expanding on @jasonduley 's question about hackathon, grants, proposals... I'm also curious what granularity this is going for.

If the goal is to capture metadata for all government-produced software, this clearly means some threshold above the script to search my shell history, this saved SQL predicate, vim plugin to format XML the way I like it, script to launch my favorite apps when I log in for the day, or example pseudo-code that accompanies a paper I wrote levels.

Developers write, think, speak, and dream in code, and not all of it attains the level of "this is a named government software project with metadata to inventory", even if it is a produced by a government agency and was given a name by its author.

Up to now, I had assumed that this effort was mainly about inventorying published open source projects. But, with the comments above about inventorying closed-source or unpublished software, this question of granularity really becomes important.

mikecharles commented 7 years ago

Q1) Does the code inventory we post exclude software the agency has developed prior to Aug 8th and only include "new code" projects formulated after Aug 8th OR Is the list exhaustive in that all software must be accounted for?

And if it only includes code after Aug 8th, I assume we can still add all of our older code if we want?

ckaran commented 7 years ago

As a suggestion, let's drop the updated field and replace it with a version field where the value is a string that obeys the Semantic Versioning guidelines. That will allow both people and automated systems to determine how important a change is, which time stamps don't allow.

jbjonesjr commented 7 years ago

I'd give a :+1: (despite implementation details) to @okamanda points about update date. Something that semantic versions or other fields don't tell without other information is how recently the code has been updated (basically a proxy for is the code still maintained).

While not all version controls systems don't provide this data to be derived as easily (a pity), even something as simple as the year of last maintenance would be an improvement and important signal to maintenance status.

jbjonesjr commented 7 years ago

Regarding formats of submitted data (json? csv? xml? yml? ), I would remind that we don't need to find one format to solve all problems and use cases.

Government/Code.gov ingest is a separate problem from discoverability by external users is a separate issue from user presentation is a separate issue from data generation. There are various tools (convertors? formatters? etc) to solve many of these issues.

I'd prioritize (Selfishly) whichever of these formats is most important for you, and let code.gov provide the facilities for other conversions (Pull Requests welcome!)

ckaran commented 7 years ago

@jbjonesjr I see the point that @okamanda was making about avoiding stale projects. However, I'm unaware of any version control system that doesn't store the date when changes were made, so the last update date can always be extracted from the VCS directly. Semantic versioning can help an automated system determine which patches can be applied, and which can't. Going from 1.10.9 to 2.0 means something in my system is going to break. But going from 1.10.9 to 1.10.10 could be handled automatically when my system downloads and applies patches.

jbjonesjr commented 7 years ago

@ckaran totally agree with the value of Semantic Versioning done right. One of my concerns with Semantic Versioning is that the end user (aka, the developers already writing this code) have to be relied on to use semantic versioning correctly. This may not always be a given (sometimes 2.0 is because a contract was recompeted, sometimes its because of breaking api changes, but sometimes it's because of marketing). While even government projects can't screw up the meaning of last commit/update.

The is key because the use case in my mind as a former govt developer and architect, is that I'd go to code.gov to see if there is already a widget that does PetShop aggregation before I build my own. When billions of dollars of government projects are provided with information on code.gov, i expect there to be many PetShop widgets available for my use, many likely from my home agency. Having an update date will help me figure out which widget to use without diving directly into each Repository to get specifics.

Not against Semantic Versioning, but don't think it can replace an update_date.

bondsbw commented 7 years ago

Semantic Versioning and update date have different purposes and serve different needs. I suggest having both.

ckaran commented 7 years ago

@jbjonesjr I agree with you that sometimes there are version bumps solely for marketing and other purposes; however, nothing prevents someone from performing a pointless update to a code base solely to cause the Last Updated field to get updated¹. That said, if we assume that people are generally honest and will not deliberately game the system, then @bondsbw is right that both have their uses. Semantic versioning will tell you how important the change is, while the Last Updated field gives you a clue about the vibrancy of the project. So, I guess I'm now voting for both fields.

[1] I'm assuming here that once a project's URL has been submitted to code.gov, then the servers can automatically look for any updated projects and update their databases accordingly. Computers are lousy at determining which changes are important ones, so this would be a trivial trick for an unscrupulous person to make it appear that their project is getting lots of updates.

rossdakin commented 7 years ago

One thought on the topology.

"Project" here seems synonymous with "repository" — I could see this being confusing when listing projects that have multiple repositories (e.g. a UI, an API, etc.).

Possible mitigations:

use "Repo description" etc. rather than "Project description"
allow multiple repositories per project
leave topology as is but add a field for "related projects"

bandrzej commented 7 years ago

Some feedback, from my personal opinion:

codeLanguage.language
Realize this should be a multiple value field, unless you are going to specify in the description the primary code language in use as a single value.
license
This should be required to point to a README.md or license document within the code repository. This clearly defines the secure rights obtained for the source code that this OMB memo is trying to solve.
openproject vs. closed project
What about combining this into one field, and its values are "Open" or "Closed"
govwideReuseproject
Why would we be listing projects that are not gov wide re-use? Use case?
exemption
Is the plan here to list source code that has exemptions to gov-wide release? How do you plan to deal with FIOAs?

bandrzej commented 7 years ago

+1 for YML per @NoahKunin

It is assumed a developer would do this task, but it is left up to the agency how it is accomplished. It would not surprise me some agencies task their Public Affairs or Security Offices to maintain since it is public facing.

bandrzej commented 7 years ago

Question:

How do you plan to track government contributions to existing public OSS projects that were not started by the government?

GSA / code-gov-web

[Request for Discussion] Software inventory metadata schema and inventory collection #41

Data Format

Collected Data

Proposed required fields:

Proposed optional fields:

Data Format

JSON

XML

CSV

YML

Collection Methodology

CRUD

Collected Data

Relationships / Reuse