[schema] Dataset schema definition

erinspace commented 10 years ago

Here's the specification of we have so far:

contributors - a list of dictionaries containing email, full name, and ORCIDs of contributors.
id - a dictionary of unique IDs given to the article based on the particular publication we’re accessing. Should include an entry for a URL that links right to the original resource, a DOI, and other entries as needed that include more unique IDs available in the original document.
meta - metadata necessary for importing to the OSF (to be further clarified later...)
properties - a dictionary containing elements of the article/study itself, sometimes within lists. Can include figures, PDFs, or any other study data made readily available by the source API. Not all resources will have this information.
description - an abstract or general description of the resource
tags - a list of tags or keywords identified in the resource itself
source - a string identifying where the resource came from
timestamp - string indicating when the article was accessed by scrAPI. YYYY-MM-DD h:m:s
title- string representing title of the article or study

erinspace commented 10 years ago

Example - from PLoS

{
    "contributors": [
        {
            "email": "loudonj@ecu.edu", 
            "full_name": "James E. Loudon",
            "id" : {"ORCID": "add-orcid-here", "other-id": "add-other-id-here"}
        }, 
    ], 
    "id": {"url": "http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0100758", 
            "DOI": "10.1371/journal.pone.0100758"},
    "meta": {"OSF specific metadata"}, 
    "properties": {
        "PDF": "http://dx.plos.org/10.1371/journal.pone.0100758.pdf", 
        "figures": [ "http://www.plosone.org/article/fetchObject.action?  uri=info:doi/10.1371/journal.pone.0100758.g001&representation=PNG_M"], 
        }, 
    "description": "This study seeks to understand how humans impact the dietary patterns of eight free-ranging vervet monkey (Chlorocebus pygerythrus) groups in South Africa using stable isotope analysis.", 
    "tags": ["Behavior"]
    ,
    "source": "PLoS", 
    "timestamp": "2014-07-11 10:31:33.168456", 
    "title": "PLOS ONE: Using Stable Carbon and Nitrogen Isotope Compositions"
}

efc commented 10 years ago

It would be useful to include a URI in this scheme. The URI would usually be derived from the id, but it would be actionable as a link back to the resource being described.

erinspace commented 10 years ago

How does everyone feel about this general schema to get started with? We can change all current scrapers to output this normalized format for now, and perhaps come up with more detailed information as needed?

efc commented 10 years ago

@erinspace, that seems reasonable. The scheme does not have to be perfect right now, we can iterate as time passes.

I'd like to point to RIOXX as a possible guide. They are just getting comment on a new version of their scheme and I think serves as an interesting model. See RIOXX v2.0 beta 1 and note, in particular, the "dc:identifier" which requires a URI. I think the clarity of this document is something for us to strive for, though we might make different choices than they do.

erinspace commented 10 years ago

Ok, going to close this issue for now, with the understanding that we can always come back and tweak things if need be. I've edited the original schema in my first comment to reflect several discussions we've had here and in other threads - about what to include for authors, IDs, and other metadata we'd request for each consumer.

CenterForOpenScience / SHARE

[schema] Dataset schema definition #20