extending DATS metadata to enable multiple new features

This proposal is for extension of the CONP-specific DATS metadata to manage a number of issues that have been suggested recently.

Structure

The suggested model is to include a number of optional extra fields within the extraProperties section of the DATS file, which is where extensions to DATS are generally recommended to go.

Specific examples:

extraProperties->expert_curated: a section for any locally curated metadata (e.g. extraProperties->expert_curated->description) extraProperties->NeuroLibre_link extraProperties->EEGnet_link extraProperties->comingSoon

[ TODO: update with fields related to experiments after further discussions on that front ]

By default, the above items are not included in a DATS file.

Suggested workflow: editor

Right now I see two variant use cases for the existing DATS editor, as well as some modifications to its standard functionality.

1) The existing DATS dataset editor should have options to include values for extraProperties->NeuroLibre_link, extraProperties->EEGnet_link and extraProperties->coming_soon. If these values are not entered, the relevant fields should not be generated in the DATS file.

2) DATS editor for Experiments. Details TBA as discussions continue, but the general idea here is to use some of our existing fields, possibly some new fields added under extraProperties, and possibly set some of our existing Required fields that are not relevant for experimental data to default values rather than requiring the user to enter them.

3) DATS editor for curation. This would display the contents of some of the existing fields (e.g. description) and include options for entry of fields under extraProperties->expert_curated.

We currently have a well-tested 1) and a prototype of 2) from Joshua Unrau. Whether it makes more sent to develop these as three separate interfaces, or one interface accepting different parameters, seems worth discussing. The user base for 3) will be different from, and more restricted than, 1) and 2). (Entirely local, or are we thinking of external expert curators as a possibility at some point?)

Workflow: updates

In this model the expert_curated data will be stored separately from the description and other fields. If a data provider later updates their descriptions etc, that will be stored parallel to any changes we have made. (In practice, this will either involve updates to different parts of the DATS file, or possibly merging different tracks of development, so can either be handled automatically by git, or may require someone at CONP to generate a PR from the data provider's copy of the data and merge two versions of DATS.json.)

Display of additional data in CONP portal

expert_curated fields, NeuroLibre_link and EEGnet_link would be additional fields behaving as existing optional fields, displayed if there is a value for them. (Should a locally curated description be displayed instead of, or as well as, the original submitted description?)

If the comingSoon flag is set, the standard data card for the dataset is replaced by a simpler "coming soon" version, design TBD. (Previously discussed in https://github.com/CONP-PCNO/conp-portal/issues/550)

Display of experimental data also TBD.

Curation and modification policy

The current proposal involves potential changes to DATS files for existing datasets, and this is very likely to continue being the case as we refine our metadata further. Historically, this can be an issue when we are modifying DATS files and submitting pull requests to datasets hosted by the data providers, as there is no predicting how long any individual data provider will take to process those pull requests.

Therefore the following proposal:

1) We agree to a clear distinction between a) metadata describing scientific content of a dataset and b) metadata that exists for technical CONP data management purposes. (Should be straightforward, but we could do with a definitive list. Also ideally give categories a) and b) snappier names.) 2) We make copies of all external datalad-managed datasets into the conp-datasets namespace, and link the CONP interface to those copies rather than the source. (We have been doing this in most cases historically.) 3) We regard the data provider as definitive for category-a) metadata (and actual data) and the conp-datasets copy of the data as definitive for category-b) metadata (and also metadata we have curated), and update each appropriately. 4) We come up with a clear statement of the above three points and foreground it such that any user submitting data to CONP will be aware of them and consent to them.

(This issue was previously mentioned in less detail in https://github.com/CONP-PCNO/conp-documentation/issues/86)

Edit: changed name of coming_soon field to comingSoon for consistency with previous development.

I propose that extraProperties->expert_curated take the form of an array of objects with the structure:

{
  'curator': '{name}',
  'curation_date': '{date}',
  'curated_fields': {
    'description': 'My new curated description',
    'keywords': ['new', 'key', 'word'],
    {possible other updated metadata fields}
  }
}

This would ensure that the provenance of curated metadata is maintained indefinitely, and allows some flexibility in how we decide which information to show on the front end.

In this case, the editor for curated information can be provided a base DATS.json for updating, and return a DATS.json with the only difference being a new entry added to the extraProperties->expert_curated array. That DATS.json can then be provided to portal staff for review and integration.

This structure has the additional benefit that fields in the top-level DATS.json "belong" to the source dataset, and changes they make to the data are managed separately to our curated metadata.

Notes:

The 'curator' field could be an object with more information about the curator if that would be useful.
This proposal does not specify whether a curated field should totally replace or instead annotate in some way the original value (if present) of the field.
The 'curated_fields' object could allow any field that is defined in our DATS model or be restricted to only fields that we decide can be usefully annotated by an external expert.

@tkkuehn: That looks very good to me. My preference is that a description curated at this level should not replace the original description, that DATS.json should store both and that the decision of which to show (or indeed both) is something that we handle in the interface based on whatever policy we decide on; this would have the benefit of structurally supporting any change we might make to that policy later, and having the capacity to handle it differently on a case-by-case basis.

Also relevant to our discussion of this morning, the proposed structured README for experiments is at https://github.com/katielavigne/documentation/wiki/Experiments-README; my inclination is to prefer structured information be stored in the DATS file rather than the README, and I think much of this could be stored in extraProperties instead, possibly in an extraProperties->experimentProperties section.

Based on our discussions before the holiday and on Katie's post linked to above, my draft proposal for fields to contain within extraProperties->experimentProperties:

functionAssessed: Brief plain-text description of what the experiment sets out to assess. I think this should be required for any experiment, and could therefore be used as a flag to tell whether a given DATS file refers to an experiment.

languages: Languages in which the experimental data is available. An array of text fields.

validation; Array of values for different ways that the results have been verified. May want subcategories? (e.g. validation->measures, validation->populations)

accessibility: Available accessibility options for the results. Array of text fields.

requirements: any additional requirements that do not fit into the categories below. Probably wants to be free text.

requirements->platforms: List of platforms on which the experiment was carried out. Array of text fields.

requirements->devices: List of devices used to carry out the experiment. Array of text fields? (Katie sounded like this would be a fairly small and well-defined list, so maybe a pull-down menu of examples could go here.)

requirements->software; List of programs used in the experiment. May want to be an array of objects, or of arrays to contain name, version, etc.

Suggested workflow, copied from email:

1) DATS editor, requiring filling in of all fields as we currently do or as modified according to the rest of this discussion; 2) When that is submitted, go to a next page with a template for the structured README (based on https://github.com/katielavigne/documentation/wiki/Experiments-README) 3) By default, fill in the title, description &c. in the structured README with the values just entered for the DATS file. 4) The user then edits this template and submits the README file when they are satisfied with it.

A note on this as I work on the experiments feature: I think the form of extraProperties is constrained a bit more than I realized (although it's possible I was the only one confused by this -- this writeup is mostly for my own benefit):

Instead of something like:

{
  ...
  "extraProperties": {
    "experimentProperties": {
      "languages": ["English"],
      "requirements": {
        "software": ["psychopy"]
      }
      ...
    }
  }
}

It will need to look something like:

{
  ...
  "extraProperties": [
    {
      "category": "experimentLanguages",
      "values": [
        {"value": "English"}
      ]
    },
    {
      "category": "experimentRequiredSoftware",
      "values": [
        {"value": "psychopy"}
      ]
    }
  ]
}

See our DATS dataset schema for documentation, and this live DATS.json for an example of how extraProperties is formatted.

CONP-PCNO / conp-portal

extending DATS metadata to enable multiple new features #571