/datasets API endpoint JSON output format

jjcarver commented 6 years ago

Below is some sample JSON that we would tentatively output from the /datasets API endpoint. The dataset used in this example is live in both MassIVE and ProteomeCentral, and can be found at the following links:

Link	URL
MassIVE dataset	https://massive.ucsd.edu/ProteoSAFe/QueryMSV?id=MSV000081125
MassIVE FTP	ftp://massive.ucsd.edu/MSV000081125
ProteomeCentral dataset	http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=6629
ProteomeCentral dataset XML	http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=6629&outputMode=XML&test=no

This is a "full" record with all files listed out:

{
    "accession": "PXD006629",
    "title": "Mitochondrial H+-ATP synthase in human skeletal muscle: contribution to dyslipidemia and insulin resistance",
    "summary": "Mitochondrial H+-ATP synthase in human skeletal muscle: contribution to dyslipidemia and insulin resistance",
    "species": [
        {"accession": "MS:1001467", "name": "taxonomy: NCBI TaxID", "value": "9606", "cvLabel": "MS"}
    ],
    "instruments": [
        {"accession": "MS:1002416", "name": "Orbitrap Fusion", "cvLabel": "MS"}
    ],
    "modifications": [
        {"accession": "UNIMOD:737", "name": "TMT6plex", "cvLabel": "UNIMOD"},
        {"accession": "UNIMOD:35", "name": "Oxidation", "cvLabel": "UNIMOD"},
        {"accession": "UNIMOD:4", "name": "Carbamidomethyl", "cvLabel": "UNIMOD"}
    ],
    "contacts": [
        {"contactProperties":[
            {"accession": "MS:1002037", "name": "dataset submitter", "cvLabel": "MS"},
            {"accession": "MS:1000586", "name": "contact name", "value": "John Lapek", "cvLabel": "MS"},
            {"accession": "MS:1000589", "name": "contact email", "value": "jlapek@ucsd.edu", "cvLabel": "MS"},
            {"accession": "MS:1000590", "name": "contact affiliation", "value": "UCSD", "cvLabel": "MS"}
        ]},
        {"contactProperties":[
            {"accession": "MS:1002332", "name": "lab head", "cvLabel": "MS"},
            {"accession": "MS:1000586", "name": "contact name", "value": "Laura Formentini", "cvLabel": "MS"},
            {"accession": "MS:1000589", "name": "contact email", "value": "lformentini@cbm.csic.es", "cvLabel": "MS"},
            {"accession": "MS:1000590", "name": "contact affiliation", "value": "UAM University Madrid", "cvLabel": "MS"}
        ]}
    ],
    "publications": [
        {"accession": "MS:1002853", "name": "Dataset with no associated published manuscript", "cvLabel": "MS"}
    ],
    "keywords": [
        {"accession": "MS:1001925", "name": "submitter keyword", "value": "mitochondria", "cvLabel": "MS"},
        {"accession": "MS:1001925", "name": "submitter keyword", "value": "insulin resistance", "cvLabel": "MS"},
        {"accession": "MS:1001925", "name": "submitter keyword", "value": "ATP synthase", "cvLabel": "MS"}
    ],
    "datasetLink": {"accession": "MS:1002488", "name": "MassIVE dataset URI", "value": "http://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=d6756ac742ed4f13811ddab2843e7d54", "cvLabel": "MS"},
    "dataFiles": [
        {"accession": "MS:1002846", "name": "Associated raw file URI", "value": "ftp://massive.ucsd.edu/MSV000081125/raw/DG000895_Francisco_Normal_Mitos.raw", "cvLabel": "MS"},
        {"accession": "MS:1002850", "name": "Peak list file URI", "value": "ftp://massive.ucsd.edu/MSV000081125/peak/DG000895_Francisco_Normal_Mitos.mzML", "cvLabel": "MS"},
        {"accession": "MS:1002845", "name": "Result file URI", "value": "ftp://massive.ucsd.edu/MSV000081125/result/DG000895_Francisco_Normal_Mitos_PSMs.mzTab", "cvLabel": "MS"},
        {"accession": "MS:1002848", "name": "Result file URI", "value": "ftp://massive.ucsd.edu/MSV000081125/ccms_result/DG000895_Francisco_Normal_Mitos_PSMs.mzTab", "cvLabel": "MS"},
        {"accession": "MS:1002851", "name": "Other type file URI", "value": "ftp://massive.ucsd.edu/MSV000081125/other/DG000895_Francisco_Normal_Mitos.zip", "cvLabel": "MS"},
        {"accession": "MS:1002851", "name": "Other type file URI", "value": "ftp://massive.ucsd.edu/MSV000081125/other/Francisco_Normal_Mitos.xlsx", "cvLabel": "MS"},
        {"accession": "MS:1002851", "name": "Other type file URI", "value": "ftp://massive.ucsd.edu/MSV000081125/ccms_parameters/params.xml", "cvLabel": "MS"},
        {"accession": "MS:1002851", "name": "Other type file URI", "value": "ftp://massive.ucsd.edu/MSV000081125/ccms_statistics/statistics.tsv", "cvLabel": "MS"}
    ],
    "links": [
        {"rel": "self", "href": "http://massive.ucsd.edu/ProteoSAFe/proxi/datasets/PXD006629"}
    ]
}

Please comment on any potential issues you see with this sample output format.

jjcarver commented 6 years ago

One issue I see is that there is a bit of a mismatch between the ProteomeXchange XML schema and the Dataset model in the YAML specification.

In the YAML, we define a "Contact" object that is basically an encapsulated array of CV term properties. This is necessary since there are many CV terms needed to fully describe a particular "Contact" (e.g. name, email, affiliation, etc).

However, we do not define a similar object for "DatasetIdentifier" or "Species", even though these are examples of objects that may require multiple terms to properly define in the XML (e.g. a dataset that has both PXD and MassIVE identifiers, or a species that we want to define by stating both its NCBI taxonomy ID and its scientific and/or common name).

Should we clarify exactly which objects may require multiple CV terms to define, and explicitly encode this into the YAML with appropriate model objects?

ypriverol commented 6 years ago

@jjcarver you are absolutely right. When I was writing some of the documentation an examples, I realize about the problem also for Publications. My proposal is to create a new Type:

OntologyTermList which is a set of OntologyTerms. Then we can have Contacts as Array of OntologyTermList (list of lists). Which this generic approach we don't need to create Types for contacts, publications, species, or DatasetIdentifiers. What do you think?

A json example as follow:

"contacts": [
        {
            [
                {"accession": "MS:1002037", "name": "dataset submitter", "cvLabel": "MS"},
                {"accession": "MS:1000586", "name": "contact name", "value": "John Lapek", "cvLabel": "MS"},
            {"accession": "MS:1000589", "name": "contact email", "value": "jlapek@ucsd.edu", "cvLabel": "MS"},
            {"accession": "MS:1000590", "name": "contact affiliation", "value": "UCSD", "cvLabel": "MS"}
        ]
       },
        {
          [
            {"accession": "MS:1002332", "name": "lab head", "cvLabel": "MS"},
            {"accession": "MS:1000586", "name": "contact name", "value": "Laura Formentini", "cvLabel": "MS"},
            {"accession": "MS:1000589", "name": "contact email", "value": "lformentini@cbm.csic.es", "cvLabel": "MS"},
            {"accession": "MS:1000590", "name": "contact affiliation", "value": "UAM University Madrid", "cvLabel": "MS"}
        ]
       }
    ],

jjcarver commented 6 years ago

I agree. This is a neat and elegant solution to the problem. The only issue I see with your example JSON is that the OntologyTermList objects need a string key to identify the array of OntologyTerms. If I understand correctly, a JSON "hash" (i.e. everything within curly braces "{}") must consist of key-value pairs where the key should be a string.

Thus instead of your example:

{
    [
        {"accession": "MS:1002037", "name": "dataset submitter", "cvLabel": "MS"},
        {"accession": "MS:1000586", "name": "contact name", "value": "John Lapek", "cvLabel": "MS"},
        {"accession": "MS:1000589", "name": "contact email", "value": "jlapek@ucsd.edu", "cvLabel": "MS"},
        {"accession": "MS:1000590", "name": "contact affiliation", "value": "UCSD", "cvLabel": "MS"}
    ]
}

you would need to have some standard property name to identify the array, e.g.:

{
    "terms": [
        {"accession": "MS:1002037", "name": "dataset submitter", "cvLabel": "MS"},
        {"accession": "MS:1000586", "name": "contact name", "value": "John Lapek", "cvLabel": "MS"},
        {"accession": "MS:1000589", "name": "contact email", "value": "jlapek@ucsd.edu", "cvLabel": "MS"},
        {"accession": "MS:1000590", "name": "contact affiliation", "value": "UCSD", "cvLabel": "MS"}
    ]
}

edeutsch commented 6 years ago

To make it follow the XML better, wouldn't it be better to do: "contact": [ instead of: "terms": [

The XML is:

Ideally the XML would be easily transformable into JSON using the same structure..

ypriverol commented 6 years ago

@edeutsch the contact is already there as contacts. The collection make clear that we are talking around contacts. if we do

datasets: contacts: contact: terms

We will need to create a data type for each Publication, Contact, Species, DatasetIdenfierList, etc .

edeutsch commented 6 years ago

okay, well, there already is a datatype contact, so that's already done. Adding data types for publication and species may be a good idea. I think we should follow the structure of the PX XML quite closely. Modifications is just a single level of structure, whereas contact and publication and species and some others are two levels. i.e. each publication can have multiple terms. each species, contact can have multiple terms grouped together.

Anyway, we already have a datatype for contact, so we should either have one for each or remove the special data type for contact?

Either way is fine with me, as long as we preserve the structure of the PX XML: http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=8339&outputMode=XML&test=no

edeutsch commented 6 years ago

rereading the thread a little more carefully, I see that the proposal is to remove contact and treat all the double-tiered data types (like contact and publication) using a generic container. that's fine with me, i think. More detailed knowledge of the schema is required to transform the JSON to the XML, but I suppose that's fine.

ypriverol commented 6 years ago

If you @jjcarver and @edeutsch agree I will include that in the present Pull Request. Then you can accept with this change included.

ypriverol commented 6 years ago

@jjcarver the current version contains the changes, check the PR https://github.com/HUPO-PSI/proxi-schemas/pull/16

ypriverol commented 6 years ago

Thanks, @jjcarver for accepting the PR. The new version contains now our changes. I will close this issue.

HUPO-PSI / proxi-schemas

/datasets API endpoint JSON output format #15