fair-research / bdbag

Big Data Bag Utilities
https://fair-research.org
Apache License 2.0
49 stars 23 forks source link

FAIR Protocol Buffer? #17

Open krobasky opened 6 years ago

krobasky commented 6 years ago

I see this repo is under 'fair-research' - has anybody started on defining a FAIR protocol buffer?

mikedarcy commented 6 years ago

Apologies, but it is not clear to me what "FAIR protocol buffer" is supposed to mean in the context of the bdbag software. Would it be possible for you to provide some more detail or reference material?

krobasky commented 6 years ago

Hi Mike - perhaps my question is misplaced, it relates to the meta-data requirement on the bdbag in order to enable FAIRness; e.g., provenance, unique identifier, keywords, licensing, that sort of thing. Thoughts?

ianfoster commented 6 years ago

The (BD)Bag specification describes a container: it is silent on many of the issues raised in the FAIR principles, like data licenses and vocabularies. However, the metadata directory provides a natural place to address those issues. We can, for example, include Research Object (RO) metadata: see https://github.com/fair-research/bdbag/blob/master/profiles/bdbag-ro-profile.json. (See https://n2t.net/minid:b9dt2t for an example of a BDBag that includes simple RO metadata.)

As Carl Kesselman noted in a recent email exchange, one could address the licensing issue, for example, by:

  1. Adding the actual license text as an asset in the BDBag and have it accessible either in the data directory or via the FETCH.TXT
  2. Using the key/value metadata in the BDBag to associate a license URI or PID with the bag. We could easily extend the profile for BDBag to include this. Extending the key/value metadata is a standard part of the BagIT spec so this is totally acceptable.
  3. Specifying the license as additional research object metadata that you associate with an asset (i.e. file) along with the other file-specific attributes, such as the file type from OBI.

If such conventions are defined, we can integrate them into the BDBag tools.

krobasky commented 6 years ago

Myself and a student have been reviewing various community FAIR efforts, mapping these to requirements for a simple metadata model. We considered those ambitious, rigorous efforts such as DATS and HCLS, and decided to start with a more rudimentary, well-scoped set of requirements that are computable, but also decoupled from implementation. For example, we took into account the convention you describe for licensing, and we also take into account versioning for objects, APIs, and even ID's (consider, for example, AAC53040 is the accession ID for the p53 protein sequence object, and the most recent version is AAC53040.1). What is the best format for sharing these conventions for your consideration and feedback? Would a protocol buffer be a proper format, or a JSON, or...?

stain commented 6 years ago

I agree that more needs to be done to expand the FAIR metadata needed.

Many of those requirements are covered by the underlying specs, for instance Research Object Bundle manifests lists basic provenance per resource. BDBags support RO manifest using the bdbag_ro.py module.

I will admit license was not listed there, we can in theory use the dct:license (from Dublin Core Terms) property in the metadata/manifest.json - that way you can assign license per aggregated file. It is however not directly listed in RO spec so it would be a JSON-LD extension which would need to be added manually by bdbag_ro.py - for instance:

"aggregates": [
  {  "uri": "../data/file.txt",
     "dct:license": {
       "uri": "http://www.apache.org/licenses/LICENSE-2.0",
       "name": "Apache License, Version 2.0" 
     }
  }
]

But this should probably feed upstream to include in a general Research Object profile of FAIR metadata attributes.

There is also schema.org/license as used by for instance BioSchemas Dataset.