Open theresaanna opened 8 years ago
Maybe it was intentional to get a fresh perspective, but it seems like the original discussions on this topic from the policy should be required reading here. See:
Seems like we may be re-hashing many of the same points. In fact, it seems like there are at least four different threads on the metadata schema topic and it's a bit confusing to follow along. Here are the threads I've identified in chronological order:
Where possible, I'd suggest trying to de-duplicate or consolide these threads or at least update the first post on the thread to distinguish the different threads if each is meant to serve a distinct purpose.
Thanks @philipashlock. Especially with your helpful write up in place, let's consolidate the discussion here. I'm going to close out the other issues and point folks to this thread, which is the most active overall.
@mattbailey0 -- Would it make sense to start working all of these discussions into the draft guidance, rather than continuing to solely use the issue threads?
What's the status of this schema? There's documentation on code.gov that seems somewhat final, but there seem to be a number of important points that haven't been addressed, this issue is still open, and the code.gov site somewhat confusingly says that both the publication of the metadata schema and implementation of the schema by agencies are due December 9th (referring to Section 7.2).
Whether or not this is final or it's possible to make some minor updates, I would suggest creating some expectations or provisions for a revision within a year or so after there's sufficient experience and feedback from those who have implemented and consumed it. We did this with the Project Open Data Metadata schema and the 1.1 update not only allowed us to address issues that had come up, but to also fully align it with the international standard established by voluntary consensus bodies (DCAT). It's understandable that there was a short timeline to establish this schema, but we don't want to create the impression that this draft will be locked in for perpetuity.
One of the ways we addressed this with the Project Open Data schema is in the v1.1. update we required implementors to explicitly state the version of the schema at the top of the file.
While it may not seem like the development of this schema is part of a standards making process, it really is if agencies are required to follow it. OMB A-119 sets out basic requirements for the use of standards in government, specifically "this Circular directs agencies to use voluntary consensus standards in lieu of government-unique standards except where inconsistent with law or otherwise impractical." In other words, government should avoid creating government-specific standards unless it has a good reason to do so. Avoiding reinventing the wheel also meets the spirit of reuse set out in this policy. With that in mind, it would be good to review existing standards and document why or why not they are practical to use here.
A number of existing schemas and specifications have been raised in this discussion including the Asset Description Metadata Schema for Software ADMS.SW used by federated national software catalogs across Europe - which integrates much of the DCAT vocabulary used for the Project Open Data data.json schema, the civic.json schema (with various flavors that have been used or proposed by the civic tech community in the U.S. including BetaNYC, Code for America, and DC Government), the Schema.org SoftwareSourceCode and SoftwareApplication schemas which appear to be implemented by a relatively small number of websites (10 and less than 50,000 respectively), and the NIST specification for Asset Identification which I think its mostly used to describe software in an operational environment rather than as an autonomous asset ready for reuse.
The current schema appears to be largely based on the civic.json specification. The pros of this is that it's something that's already been developed by the community and it's relatively simple. The cons of this is that it's not clear that it's widely been used, well documented, or even proposed consistently enough to enable interoperability.
The ADMS.SW specification seems like the most robust standard aligned with the needs of Code.gov. The pros of this is that it's been developed through formal voluntary consensus bodies, is thoroughly documented, aligns with the DCAT schema used for the open data policy, and is implemented in a federated way by European government bodies just as needed by U.S. federal agencies. The cons of this is that it appears overly complex with very dense documentation. You can see a full PDF copy of the ADMS.SW spec here (copied from here) and a presentation about it here
The Schema.org schemas are fairly simple, well documented, and developed through a voluntary consensus process. One of the biggest pros is that these are supported by the major search engines which means that they should be indexed by search engines and that's the most likely way people will find software (not on code.gov). The con is that these are not yet well adopted, at least not SoftwareSourceCode, and the search engines do not yet appear to be doing anything special to index these. However, it's totally possible to implement one of the schemas mentioned above while also implementing a schema.org schema, but you'll want to be sure there's a good mapping between the two. We did this with the Project Open Data metadata schema, but it was fairly easy because the POD schema is merely an extension of DCAT and the schema.org Dataset schema was explicitly based on DCAT. None of the major search engines were doing anything special by indexing the schema.org Dataset schema when it was first implemented on Data.gov, but Google is now working on this more and expanding the Dataset schema for the way Google wants to index things like Science Datasets and I think we can expect something similar to happen with software.
So while it seems like a fairly final decision to develop something new based on the civic.json schema, I think it's worth considering whether more could be done to leverage the work that's gone into ADMS.SW, to reuse the elements in DCAT already used by the open data policy, to align with a formal voluntary consensus standard, and to allow for interoperability with the federated European software catalog. That said, more should be done to provide a simplified profile of ADMS.SW and to better understand the pros and cons of ADMS.SW in practice. We did this with POD v1.1 and DCAT by working with W3C to make data.json a formal representation of DCAT with JSON-LD and I think we found a good compromise. When POD v1.0 was developed, it was mostly aligned with DCAT, but DCAT had not been finalized. POD v1.1 is now compatible with DCAT and a large portion of national data catalogs around the world use DCAT. The European Union uses DCAT as the basis for their federated Europe-wide data catalog.
And even where an existing specification isn't fully packaged to meet all the needs here, you can still assemble fields from existing vocabularies. This allows for field level interoperability and can ensure you reuse properties that are already well defined rather than coin new ones that are vague or inconsistent.
In the meantime, here's some feedback on specific fields (some of this reiterates or emphasizes John's comments)
agency
- there are no official or consistent acronyms for government agencies in the federal government. To ensure consistency, you'll have to use a unique identifier like we did with Project Open Data. We primarily used bureauCode
but GSA is also working on a more universal unique identifier system for agencies. Additionally, ideally this field would not be government specific. I would also suggest that this field be associated with each project entry rather than with the whole catalog as this will allow the metadata to be more easily mixed and aggregated across multiple sources without losing this important data.
organization
- for Project Open Data we allowed folks to use the publisher
field to optionally provide the context of where the office sits in the agency by indicating some level of hierarchy. I would also suggest that this field be associated with each project entry rather than with the whole catalog as this will allow the metadata to be more easily mixed and aggregated across multiple sources without losing this important data.
openSourceProject
- this seems somewhat redundant with license. The policy defines open source as anything meeting the Open Source Definition and OSI has a list of licenses that meet that definition, so this field could just be derived from the license. Even if you feel the need to keep it here, I'd make it explicit that this means the code is licensed (or unlicensend) in a way that meets the OSD. It's also worth noting that OSI has not accepted CC0 as meeting the definition, but does recognize the public domain status of U.S. Government Works. This is a topic that should be discussed and debated further, but it might be worth considering whether it's better to use the usa.gov URL for U.S. public domain as defined by Project Open Data rather than assert international public domain with CC0 like we suggest for datasets. The difference in these use cases, as explained by OSI, has to do with patent rights which are relevant for software, but not data. Additionally, this field should use a boolean (true
or false
) not an integer since the the boolean datatype is intended specifically for this purpose and is more human readable.
governmentWideReuseProject
- this should be renamed so it's less government-specific, e.g. designedForReuse
and it should use a boolean (true
or false
) not an integer since the the boolean datatype is intended specifically for this purpose and is more human readable.
languages
- this should make it clear that it's referring to the code language rather than human language. In ADMS.SW, DCAT, and Schema.org, language
is used to refer to the human language used by the asset, whereas schema.org uses a term like programmingLanguage
on their SoftwareSourceCode
schema to be clear they're referring to code not content. This should also be singular, not plural regardless of whether the data type is singular or not.
exemption
- I'd suggest making this more explicit like reuseExemption
and using a more human readable controlled vocabulary for the excemption reasons rather than integers.
identifier
- It's important to try to establish a globally unique identifier for each project since many other fields will change and it will be hard to track the entry without a unique identifier. Data.gov uses the identifier
field to know when an entry has been added or removed rather than updated. This field should be globally unique using a URI to avoid collisions from different catalogs when aggregated from multiple sources. This should be a required field.
provenance
or source
- In the spirit of reuse, it'd be helpful to know this codebase was forked or otherwise derived from a separate upstream codebase. This could be the URI of the unique identifier or the URL of the project.
I recommend JSON for many of the reasons other have stated. It has worked for Project Open Data data.json and we have built out the infrastructure to validate and harvest in this format. JSON-LD is also now the format recommended by Google for schema.org schemas and other structured data on webpages. Some have suggested YAML as an alternate because it's more human readable and easy for folks to edit, but this also means it's more likely to result in poor or inconsistent data quality for any data structure with even moderate complexity. With the initial implementation of the Project Open Data data.json schema, many folks attempted to maintain their JSON metadata by hand, but this resulted in the majority of the problems we encountered with regard to harvesting and interoperability. I would strongly suggest that we do not rely on a structured data format that is edited by hand, but agencies are free to allow for this upstream as long as they validate it when compiling their aggregate copy. It's worth noting that JSON is actually a subset of YAML, so agencies could allow either YAML or JSON from individual offices if they're using a YAML parser, but they'll still have to validate it against the final JSON schema requirements and provide a comprehensive JSON version.
I've attempted an initial mapping between code.json and ADMS.SW. Note that ADMS.SW follows the same conceptual model as DCAT used for Project Open Data data.json:
To ensure that the Data Catalog Vocabulary (DCAT), the Asset Description Metadata Schema (ADMS), and the Asset Description Metadata Schema for Software (ADMS.SW) are seeded on the same structure, the RADion vocabulary was created [RADion]. RADion is shorthand for Repository, Asset, and Distribution – the three structural elements that RADion abstracts from.
In ADMS.SW, the concepts Software Repository, Software Release and Software Package are defined as specialisations of the more general concepts Repository, Asset and Distribution specified by RADion
To clarify these relationships, I created a visual diagram similar to the Schema Object Model Diagram provided for the Project Open Data version of DCAT, but this diagram includes all the fields provided by ADMS.SW rather than paired down to just the required, optional, and extended fields as is the case with the POD diagram.
The property mapping and descriptions here are based on the full ADMS.SW documentation PDF and the HTML version of the RDF schema. I would refer to those documents for full property definitions. Also note that some of the properties here are synonymous with those in DCAT even if they use a different property name or namespace.
A Software Repository is a system or service that provides facilities for storage and maintenance of descriptions of Software Projects, Software Releases and Software Packages, and functionality that allows users to search and access these descriptions. A Software Repository will typically contain descriptions of several Software Projects, Software Releases and related Software Packages.
An example of a Software Repository is the Apache Software Foundation Project Catalogue
ADMS.SW Property | ADMS.SW Label | Namespace:Property | code.json Property |
---|---|---|---|
accessURL | Access URL | adms:accessURL | |
created | Date of Creation | dcterms:created | |
modified | Date of Last Modification | dcterms:modified | |
description | Description | dcterms:description | |
label | Name | rdfs:label | |
supportedSchema | Supported Schema | adms:supportedSchema | |
hasPart | Includes | dcterms:hasPart | |
publisher | Publisher | dcterms:publisher | agency or organization |
spatial | Spatial Coverage | dcterms:spatial | |
themeTaxonomy | Theme Taxonomy | rad:themeTaxonomy |
A Software Project is a time-delimited undertaking with the objective to produce one or more software releases, materialised as software packages. Some projects are long-running undertakings, and do not have a clear time-delimited nature or project organisation. In this case, the term ‘software project’ can be interpreted as the result of the work: a collection of related software releases that serve a common purpose.
An example of a Software Project is the Apache HTTP Server Project
ADMS.SW Property | ADMS.SW Label | Namespace:Property | code.json Property |
---|---|---|---|
description | Description | doap:description | project.description |
homepage | Homepage | doap:homepage | project.homepage |
keyword | Keyword | rad:keyword | project.tags |
name | Name | doap:name | project.name |
release | Release | doap:release | |
contributor | Contributor | schema:contributor | project.partners |
fundedBy | Funded By | admssw:fundedBy | project.partners |
forkOf | Fork Of | admssw:forkOf | |
developer | Developer | doap:developer | project.partners |
documenter | Documenter | doap:documenter | project.partners |
maintainer | Maintainer | doap:maintainer | project.contact |
helper | Helper | doap:helper | project.partners |
tester | Tester | doap:tester | project.partners |
translator | Translator | doap:translator | project.partners |
metrics | Metrics | admssw:metrics | |
theme | Theme | rad:theme | |
intendedAudience | Intended Audience | admssw:intendedAudience | project.governmentWideReuseProject |
locale | Locale | admssw:locale | |
userInterfaceType | User Interface Type | admssw:userInterfaceType | |
programmingLanguage | Programming Language | admssw:programmingLanguage | project.languages |
isPartOf | Repository Origin | dcterms:isPartOf | project.repository |
operatingSystem | Operating System | schema:operatingSystem | |
supportsFormat | Supports Format | admssw:supportsFormat | |
status | Status | admssw:status | project.status |
A Software Release is an abstract entity that reflects the intellectual content of the software at a particular point in time and represents those characteristics of the software that are independent of its physical embodiment. This abstract entity corresponds to the FRBR entity expression (the intellectual or artistic realization of a work). A release is typically associated with a version number.
An example of a Software Release is the Apache HTTP Server 2.22.22 (httpd) release.
ADMS.SW Property | ADMS.SW Label | Namespace:Property | code.json Property |
---|---|---|---|
alternative | Alternative Name | dcterms:alternative | |
created | Date of Creation | dcterms:created | |
modified | Date of Last Modification | dcterms:modified | project.updated.sourceCodeLastModified |
description | Description | dcterms:description | |
identifier | Identifier | admssw:identifier | |
keyword | Keyword | rad:keyword | |
metadataDate | Metadata Data | adms:metadataDate | project.updated.metadataLastUpdated |
name | Label | rdfs:label | |
revision | Version | doap:revision | |
releaseNotes | Version Notes | schema:releaseNotes | |
assessment | Assessment | admssw:assessment | |
contactPoint | Contact Point | adms:contactPoint | project.contact |
includedAsset | Included Asset | admssw:includedAsset | |
metrics | Metrics | admssw:metrics | |
language | Language | dcterms:language | |
logo | Logo | foaf:logo | |
describedBy | Main Documentation | wdrs:describedby | |
metadataLanguage | Metadata Language | adms:metadataLanguage | |
last | Current Version | xhv:last | |
next | Next Version | xhv:next | |
prev | Previous Version | xhv:prev | |
project | Project | admssw:project | |
publisher | Publisher | dcterms:publisher | |
relation | Related Asset | dcterms:relation | |
relatedWebPage | Related Web Page | adms:relatedWebPage | |
package | Package | admssw:package | |
isPartOf | Repository Origin | dcterms:isPartOf | project.repository |
spatial | spatial coverage | dcterms:spatial | |
status | Status | admssw:status | |
theme | Theme | rad:theme | |
usedBy | Used By | admssw:usedBy |
A Software Package represents a particular physical embodiment of a Software Release, which is an example of the FRBR entity manifestation (the physical embodiment of an expression of a work). A Software Package is typically a downloadable computer file (but in principle it could also be a paper document) that implements the intellectual content of a Software Release. A particular Software Package is associated with one and only one Software Release, while all Packages of an Asset share the same intellectual content in different physical formats.
An example of a Software Package is httpd-2.2.22.tar.gz, which represents the Unix Source of the Apache HTTP Server 2.22.22 (httpd) software release.
Software often has at least two kinds of physical embodiments: a source code package and a binary package. Binary packages are sometimes compiled for different operating systems or are released under difference licences, e.g. in case of dual licensing. Also scripting languages need some sort of packaging for installation systems used by end users.
ADMS.SW Property | ADMS.SW Label | Namespace:Property | code.json Property |
---|---|---|---|
created | Date of creation | dcterms:created | |
modified | Date of last modification | dcterms:modified | |
description | Description | dcterms:description | |
label | Name | rdfs:label | |
software_id | Software_id | swid:software_id | |
tagURL | Tag URL | admssw:tagURL | |
fileSize | File size | schema:fileSize | |
checksum | Checksum | spdx:checksum | |
format | Format | dcterms:format | |
license | License | dcterms:license | project.license |
downloadUrl | Download URL | schema:downloadUrl | project.downloadURL |
release | Release | amdssw:release | |
publisher | Publisher | dcterms:publisher | |
status | Status | admssw:status |
As @philipashlock noted on November 7, CC0
is not considered Open Source by the OSI. Appendix A defines Open Source Software as:
Open Source Software (OSS): Software that can be accessed, used, modified, and shared by anyone. OSS is often distributed under licenses that comply with the definition of “Open Source” provided by the Open Source Initiative (https://opensource.org/osd) and/or that meet the definition of “Free Software” provided by the Free Software Foundation (https://www.gnu.org/philosophy/free-sw.html).
The first part suggests to me that anything under the CC0
license is considered to be Open Source, but the second part suggests that it isn't. What is the official consensus? How should we mark a project's openSourceProject
key in their code.json
file if they are using the CC0
license?
Part of my concern is how automated tools will handle the openSourceProject
key for metrics purposes; if CC0
is not considered to be Open Source, then quite a few agencies will not be able to meet the 20% requirement, even though they are putting their code out there for others to use.
@ckaran CC0 is addressed in OSI's FAQs. No decision was made by OSI whether it meets their definition of "Open Source". However, it would be useful to know what definition of "Open Source" to use when completing the openSourceProject
field. Does code.gov
offer a definition for the purpose of this field? Personally, I'd prefer deferring to OSI's definition, but if OSI can't reach a decision on CC0, then their definition is insufficient.
My personal recommendation is to avoid "traps" like CC0, where it's "open" with respect to copyright, but patent use rights are explicitly not conferred. MIT and BSD avoid the question entirely (not explicitly conferred), and GPL tends to impose restrictions on consumers that I don't think the government should be in the business of imposing, so I prefer ASL 2.0, myself, for government-released open source projects. (ASL 2.0 also provides a convention to use a NOTICE file for copyright notices, separate from the license, where it would be appropriate to add a brief text noting the license is not applicable domestically for the portions of code produced exclusively by government employees on behalf of the U.S. government.)
cc/ @benbalter if you have specific thoughts to share:
@ctubbsii You're right about the problems of patents, etc. with regards to CC0. The lab I work for has been working to avoid the problem by requiring all external contributors sign a contributor license agreement (CLA) before their contributions will be included in any of the lab's projects (you can read the policy here. The lab's lawyers believe that will solve the issues directly related to patents and other IP rights.
Note that the policy was adopted by the lab on 19 Dec 2016, but it already has one issue; we can't currently post our CLA, nor can we accept CLAs at the current time as (by design) an executed CLA will contain what can be argued to be personally identifiable information (PII). The lawyers I've talked to tell me that means the lab must obey the Privacy Act, which requires some more work. So, if you read the policy and expect that we'll be able to start accepting contributions immediately, I'm sorry to say that we can't.
@ckaran Another great thing about ASL 2.0, it contains an embedded CLA, and defines "Contributor" and "Contributions". No need for a separate CLA :smile:
@ctubbsii Honestly, if we could, I would recommend the standard OSI-approved licenses, including the the ASL 2.0 for exactly that reason. Unfortunately, most of the work produced by my lab doesn't have copyright attached, which means that copyright-based licenses may fail in court.
@ckaran Not sure what you mean by "doesn't have copyright attached". My guess is that you mean "public domain" or you simply mean that nobody is interested in asserting copyright. If it's the former, it probably only applies domestically. The creators still may own copyright internationally, so a license is still worth recommending. If you mean the latter, well, omission of a copyright notice does not disclaim copyrights.
If a work isn't covered by copyright (because it's public domain, for instance), an infringement claim would certainly fail in court... but I'm not sure why that matters. That only matters if the creators intend to enforce/assert their copyright claims in the face of a particular infringement, via a lawsuit. If you know the work is public domain in the jurisdiction where the violation occurred... simply don't pursue it with a lawsuit in that case... it's really as simple as that.
The license still communicates the limitations of the rights granted in jurisdictions where copyright is applicable (who cares if it's void in jurisdictions where it's not applicable?) and communicates a minimum set of rights guaranteed to everybody else. This instills confidence in the project's users, allowing them to use it according to the license conditions without fear of reprisal. Often (as in the case of ASL 2.0), it also explicitly conveys the rights the project expects contributors to grant, in order for the contributions to be accepted into the project (in other cases, this might be implicit). This is valuable to a project, even if some portions of the project are not subject to copyright protections (public domain).
@ctubbsii Sorry, I've been talking with our legal counsel for too long. Yes, I mean works that are in the public domain. I've talked with the appropriate people in the Justice Department to see if US Government works have copyright outside of the US. They told me that the US Government's position is that it does, but the lawyer I spoke with wasn't able to find any case law to back that up. What's more, it would have to be litigated in the courts of each nation individually, so there isn't a single 'right' answer.
As for why all this is important, it comes down to severability and warranty/liability. Assume that some Government work is licensed under the Apache License 2.0, which is a license that depends on copyright. Someone can sue the Government claiming that the clauses that depend on copyright are void, and (because there is no severability clause), so are all the other clauses. If a court agrees that the license as a whole is void simply because the US Government doesn't have copyright within the US, then that includes the clauses regarding warranty and liability, which means that the Government might be on the hook for damages in some manner[1]. Moreover, downstream users/projects may also have problems[1].
For works that have copyright and are contributed to the Government, I think that the Government would be OK with any of the standard OSI-approved licenses. However, work that is created by Government employees might be in the public domain, so then you have a weird mix of stuff that is protected by the license, and stuff that might not be[1]. Will this cause an issue? I don't know, but I'm not interested in finding out.
[1] I'm not a lawyer, this is not legal advice, and as far as I know, this has not yet been litigated in a court.
@ckaran Oh, I see. Perhaps code.gov should fork ASL 2.0 (which is permitted) and add a severability clause. (Note: I'm currently promoting a discussion on the Apache Mailing Lists about adding this in some future version of the license, perhaps 2.1).
@ctubbsii I've thought about forking it, but that could also start to fork Open Source (there will be questions about which licenses are compatible with other license, which could be problematic; @massonpj, is this a good assessment?)
@ctubbsii I've seen your discussions on the ASL lists; I think that is the best way to go. Not only could everyone (Government and private) use the same license, it would also mean that the license is OSI-approved, which the forked license may not be. The reason this is important is because some journals will only accept code that is under and OSI-approved license; JOSS is one of them. See the discussion here for some of the issues.
Basically, what I want are modifications to the standard Open Source licenses that ensure that works that don't have copyright attached have all the following:
[1] Public domain code by definition doesn't have copyright protections, but in a mixed work that has some copyrighted material and some public domain material, the copyrighted material should not be effectively reduced to being public domain; if that was what the authors had intended, then they would have put it in the public domain. That means that license has to be inherently flexible enough to handle this case. IP protections means that public domain work doesn't get hammered by patent headaches from contributions.
@ckaran
is this a good assessment
Yes.
@ctubbsii, while anyone can create their own license, the OSI's License Review Process, "ensures that licenses and software labeled as 'open source' conforms to existing community norms and expectations." Simply creating a new license and labeling it an "open source software license" is not good.
@massonpj Obviously, any new license should be approved by both FSF and OSI. The biggest issue I think OSI would have is seeing it as "duplicative" if it's too similar.
Part of the Federal Source Code Policy requires that federal agencies make available an inventory of metadata describing their custom software. We’re exploring ways for agencies to provide their inventories. We want to implement a solution that works well for agencies and we need your help to do that.
The Federal Source Code Policy describes code.gov as “the primary discoverability portal for custom-developed code intended both for Government-wide reuse and for release as OSS.” The inventory data that agencies provide will be made available through code.gov. The data we collect should make it possible for agencies to find projects relevant to their needs.
There are two primary areas we see where decisions need to be made: the data format and what data is collected.
Data Format
The two options we are considering are CSV and JSON. The assumed benefit to a CSV-based approach is that it is easier for agencies to create and maintain a CSV than JSON. With this approach, we might create a system for agencies to submit their inventory CSV. With a JSON-based approach, we might ask agencies to make the “inventory.json” available on their website and we would have a system to retrieve inventories as they change. One drawback to JSON is that it is more effort to maintain, takes specialized knowledge, and we may need to provide a tool to build the JSON. On the other hand, JSON is easy to work with programmatically and matches what Data.gov does, meaning many agencies have some familiarity with the process that inventory updating would entail.
The unanswered questions on data format are:
Collected Data
In either data format, we need to determine what data we will collect. Below is a list of fields we are considering accepting.
Proposed required fields:
Proposed optional fields:
For an idea of what the data might look like, we have an early draft of a schema with example content: (https://gist.github.com/theresaanna/a82bfb39b64362bca04e4644706b0ce4)
The questions that we are looking to answer here are:
Thanks for your feedback! It’s crucial for us in meeting our goal of providing a system and schema that are easy to use and meets agencies' needs.