HumanCellAtlas / ingest-central

Ingest Central is the hub repository for the ingest service
Apache License 2.0
0 stars 1 forks source link

[USI] MVP for USI submissions needed before archiving in production #326

Closed malloryfreeberg closed 5 years ago

malloryfreeberg commented 5 years ago

The following items are considered "must-haves" for being able to submit HCA prod datasets to ENA, BioStudies, and BioSamples prod archives.

  1. Release date (ENA) - The release date for all submissions must be set to the HCA submission date; otherwise, the submissions will not be viewable (i.e. will be private) in the archive. Currently ENA release is set for 2 years in the future. It is unclear if this issue should be addressed by the USI or ENA.

    • [x] Set ENA release date set to HCA submission date. NB: Current workaround is to update the Release Date after submission.
  2. Qualified metadata field names (BioSamples) - Currently, the HCA metadata field names submitted through USI to BioSamples are the fully-qualified names, which makes them unwieldy and hard to read. The unqualified field names should be submitted instead. Also would like to format the HCA ontologized field so there isn't as much duplication. I think this is an issue to be addressed by HCA.

    • [x] Either (1 ideally) Use user-friendly names instead of fully-qualified field names or (2 backup) Remove upstream parts of the fully-qualified field names.
    • [x] Update ontology fields. Either (1 ideally) format them as BioSamples formats them or (2 backup) remove redundant ontology_label field and make sure the text and ontology fields are adjacent.
  3. HCA project accession (BioStudies) - Currently, the HCA Project UUID is embedded in the "alias" field in the BioStudies submission. Other archives have dedicated HCA UUID fields (e.g. "HCA Biomaterial UUID" in BioSamples). We should be submitting an "HCA Project UUID" key:value to BioStudies to uniquely reference the HCA Project. The BioSamples "alias" field is not required by BioStudies. Now being tracked in https://github.com/HumanCellAtlas/ingest-central/issues/344

    • [x] Add a "HCA Project UUID" field to BioStudies submission and populate with the project UUID.
    • [x] Make sure "HCA Project UUID" is being passed to BioStudies and is available in the JSON.
    • [x] ~Remove "alias" field if not required by USI.~ Don't need for now. Will make new ticket if needed in future.
  4. HCA funding and publication information (BioStudies) - Need to test whether this information in an HCA project gets correctly submitted to BioStudies. If it doesn't, need to figure out how to do it.

    • [x] Pass HCA Funders metadata from HCA to USI.
    • [x] Pass HCA Funders metadata from USI to BioStudies.
    • [x] Pass HCA Publication metadata to BioStudies.
  5. Fix HCA contact ORCID IDs regular expression that is validated in USI.

    • [x] USI regex allows all valid ORCID IDs.

The following items are considered "nice-to-haves" for being able to submit HCA prod datasets to ENA, BioStudies, and BioSamples prod archives.

  1. HCA metadata duplicated (ENA) - Project and experiment XML contain duplicate metadata where specific ENA fields were set based on HCA metadata. The HCA metadata that was used to set these fields is also presented.

    • [x] Remove duplicated Project and Experiment XML fields.
  2. HCA project tag (BioSamples)

    • [x] Add an HCA-specific project tag to archived samples. Based on a current example, the attribute could look like:
    "project" : [ {
      "text" : "Human Cell Atlas"
    } ],
  1. HCA metadata for contact name not parsed corrected (BioStudies)
    • [x] Parse contact name to get just the first name for the "firstName" field.

Currently looks like this:

     {
      "name": "firstName",
      "value": "Mirjana,,Efremova"
     },

Should look like this:

     {
      "name": "firstName",
      "value": "Mirjana"
     },
  1. Remove contributors who are not part of the actual project from list of authors submitted to BioStudies. Now being tracked in https://github.com/HumanCellAtlas/ingest-central/issues/345

    • [x] Remove HCA Wranglers and external curators from list of authors submitted to BioStudies.
  2. Release date (ENA) - The release date for all submissions must be set to the HCA submission date; otherwise, the submissions will not be viewable (i.e. will be private) in the archive. Currently ENA release is set for 2 years in the future. It is unclear if this issue should be addressed by the USI or ENA. Now being tracked in https://github.com/HumanCellAtlas/ingest-central/issues/346

    • [ ] Programmatically set ENA release date to HCA submission date.
malloryfreeberg commented 5 years ago

@aaclan-ebi :)

ke4 commented 5 years ago
  1. item should be done by USI. Currently USI does not send the release date for the study. I discussed it with the ENA team and USI should set it on the Submission with a specific action element in the XML that is submitted to the ENA. I am working on this.
ke4 commented 5 years ago

Point 5. is done and successfully tested by @malloryfreeberg

ke4 commented 5 years ago

Point 4. : Funding information is currently not covered in the Project entity in USI. @malloryfreeberg and/or @aaclan-ebi : Could you give me more information about what information we need to store under funding, please? At the same time I am going to get some information from the BioStudies team related this one, as well.

malloryfreeberg commented 5 years ago

Currently, the HCA only has 3 funder fields (schema link):

        "grant_title": {
            "description": "The name of the grant funding the project.",
            "type": "string",
            "example": "Study of single cells in the human body.",
            "user_friendly": "Grant title",
            "guidelines": "Enter a title of approximately 30 words."
        },
        "grant_id": {
            "description": "The unique grant identifier or reference.",
            "type": "string",
            "example": "BB/P0000001/1",
            "user_friendly": "Grant ID"
        },
        "organization": {
            "description": "The name of the funding organization.",
            "type": "string",
            "example": "Biotechnology and Biological Sciences Research Council (BBSRC); California Institute of Regenerative Medicine (CIRM)",
            "user_friendly": "Funding organization"
        }

HCA grant_id maps to BioStudies grant_id and HCA organization maps to BioStudies Agency.

aaclan-ebi commented 5 years ago

Point 2 is done. Please verify.

Example: HCA Biomaterial -> BioSamples Sample:

    "content": {
        "describedBy": "http://schema.integration.data.humancellatlas.org/type/biomaterial/10.1.1/donor_organism",
        "schema_type": "biomaterial",
        "biomaterial_core": {
            "biomaterial_id": "Q4_DEMO-donor_MGH30",
            "biomaterial_name": "Q4 DEMO donor MGH30",
            "biomaterial_description": "Description",
            "ncbi_taxon_id": [
                9606
            ]
        },
        "medical_history": {
            "smoking_history": "yes"
        },
        "genus_species": [{
            "text": "Homo sapiens",
            "ontology": "NCBITaxon:9606",
            "ontology_label": "label"
        }],
        "is_living": "no",
        "sex": "unknown"
    },
    "submissionDate": "2018-09-12T02:59:58.368Z",
    "updateDate": "2018-09-12T03:00:01.696Z",
    "user": "anonymousUser",
    "lastModifiedUser": "anonymousUser",
    "uuid": {
        "uuid": "d85ac5e2-733c-46ad-937f-db5d704aa177"
    },
    "events": [],
    "accession": null,
    "validationState": "Valid",
    "validationErrors": []
}
{
  "alias": "hca339",
  "description": "Description",
  "attributes": {
    "HCA Biomaterial Type":[{
          "value": "donor_organism"
    }],
    "HCA Biomaterial UUID":[{
          "value": ""
    }],
    "Biomaterial - Biomaterial Core - Biomaterial Id": [{
        "value": "Q4_DEMO-donor_MGH30"
    }],
    "Biomaterial - Is Living": [{
        "value": "no"
    }],
    "Biomaterial - Medical History - Smoking History": [{
        "value": "yes"
    }],
    "Biomaterial - Sex": [{
        "value": "unknown"
    }]
  },
  "releaseDate": "2018-09-12",
  "sampleRelationships": [],
  "taxonId": 9606,
  "taxon": "Homo sapiens",
  "title": "Q4 DEMO donor MGH30"
}

Test submission: http://wwwdev.ebi.ac.uk/biosamples/samples/SAMEA5574155

malloryfreeberg commented 5 years ago

Point 2 is done. Please verify.

Looks great! The ontology links work for me.

malloryfreeberg commented 5 years ago

@aaclan-ebi how hard would it be to do point 3 under "nice to haves"?

  1. HCA metadata for contact name not parsed corrected (BioStudies)
    • [ ] Parse contact name to get just the first name for the "firstName" field.
aaclan-ebi commented 5 years ago

@malloryfreeberg I've pushed some changes to parse the contact name. If the contact name couldn't be parsed, that will be raised as an error in the REPORT.json output of the archiver script.

HCA Project contributors

"contributors": [{
    "contact_name": "First,Middle,Last",
    "email": "dummy@email.com",
    "institution": "Fake Institution"
},{
    "contact_name": "First2,Middle2,Last2",
    "email": "dummy2@email.com",
    "institution": "Fake Institution2"
}],

BioStudies - Project contacts

 "contacts": [{
    "orcid": "",
    "firstName": "First",
    "middleInitials": "M",
    "lastName": "Last",
    "email": "dummy@email.com",
    "address": "",
    "affiliation": "Fake Institution",
    "phone": ""
},{
    "orcid": "",
    "firstName": "First2",
    "middleInitials": "M",
    "lastName": "Last2",
    "email": "dummy2@email.com",
    "address": "",
    "affiliation": "Fake Institution2",
    "phone": ""
}],

I also replaced the '|' delimiter to be ',' in publication authors:

HCA Project publications

"publications": [{
    "authors": [
        "Vento-Tormo R",
        "Efremova M",
        "Botting RA",
        "Turco MY",
        "Vento-Tormo M",
        "Meyer KB",
        "Park J",
        "Stephenson E",
        "Polanski K",
        "Payne RP",
        "Goncalves A",
        "Zou A",
        "Henriksson J",
        "Wood L",
        "Lisgo S",
        "Filby A",
        "Wright GJ",
        "Stubbington MJ",
        "Haniffa M",
        "Moffett A",
        "Teichmann SA"
    ],
    "publication_title": "Reconstructing the human first trimester fetal-maternal interface using single cell transcriptomics",
    "doi": "10.1101/429589",
        "publication_url": "https://www.biorxiv.org/content/early/2018/09/29/429589"
}]

BioStudies Project publications

"publications": [{
    "pubmedId": "",
    "doi": "10.1101/429589",
    "articleTitle": "Reconstructing the human first trimester fetal-maternal interface using single cell transcriptomics",
    "authors": "Vento-Tormo R , Efremova M , Botting RA , Turco MY , Vento-Tormo M , Meyer KB , Park J , Stephenson E , Polanski K , Payne RP , Goncalves A , Zou A , Henriksson J , Wood L , Lisgo S , Filby A , Wright GJ , Stubbington MJ , Haniffa M , Moffett A , Teichmann SA"
}]
aaclan-ebi commented 5 years ago

Point 2 update:

Example: HCA Biomaterial -> BioSamples Sample:

    "content": {
        "describedBy": "http://schema.integration.data.humancellatlas.org/type/biomaterial/10.1.1/donor_organism",
        "schema_type": "biomaterial",
        "biomaterial_core": {
            "biomaterial_id": "Q4_DEMO-donor_MGH30",
            "biomaterial_name": "Q4 DEMO donor MGH30",
            "biomaterial_description": "Description",
            "ncbi_taxon_id": [
                9606
            ]
        },
        "medical_history": {
            "smoking_history": "yes"
        },
        "genus_species": [{
            "text": "Homo sapiens",
            "ontology": "NCBITaxon:9606",
            "ontology_label": "label"
        }],
        "is_living": "no",
        "sex": "unknown"
    },
    "submissionDate": "2018-09-12T02:59:58.368Z",
    "updateDate": "2018-09-12T03:00:01.696Z",
    "user": "anonymousUser",
    "lastModifiedUser": "anonymousUser",
    "uuid": {
        "uuid": "d85ac5e2-733c-46ad-937f-db5d704aa177"
    },
    "events": [],
    "accession": null,
    "validationState": "Valid",
    "validationErrors": []
}
{
  "alias": "hca339",
  "description": "Description",
  "attributes": {
    "HCA Biomaterial Type":[{
          "value": "donor_organism"
    }],
    "HCA Biomaterial UUID":[{
          "value": ""
    }],
    "Biomaterial Core - Biomaterial Id": [{
        "value": "Q4_DEMO-donor_MGH30"
    }],
    "Is Living": [{
        "value": "no"
    }],
    "Medical History - Smoking History": [{
        "value": "yes"
    }],
    "Sex": [{
        "value": "unknown"
    }]
  },
  "releaseDate": "2018-09-12",
  "sampleRelationships": [],
  "taxonId": 9606,
  "taxon": "Homo sapiens",
  "title": "Q4 DEMO donor MGH30"
}

Test submission: http://wwwdev.ebi.ac.uk/biosamples/samples/SAMEA5574183

aaclan-ebi commented 5 years ago

Nice to have # 3 project tag in BioSamples is done.

aaclan-ebi commented 5 years ago

Must-have # 4 (Pass HCA Funders metadata from HCA to USI) is done.

aaclan-ebi commented 5 years ago

@malloryfreeberg For Nice-to-have # 4 (Remove HCA Wranglers and external curators from list of authors submitted to BioStudies) Is there a field in the HCA contributor metadata that we could check to know who to include as BioStudies contact?

mcourtot commented 5 years ago

[minor comment] Re ontology links it'd be better to use the full URI "ontology": "http://purl.obolibrary.org/obo/NCBITaxon_9606", instead of "ontology": "NCBITaxon:9606",

(note the underscore vs colon in the CURIE)

malloryfreeberg commented 5 years ago

@malloryfreeberg For Nice-to-have # 4 (Remove HCA Wranglers and external curators from list of authors submitted to BioStudies) Is there a field in the HCA contributor metadata that we could check to know who to include as BioStudies contact?

@aaclan-ebi project_role

malloryfreeberg commented 5 years ago

[minor comment] Re ontology links it'd be better to use the full URI "ontology": "http://purl.obolibrary.org/obo/NCBITaxon_9606", instead of "ontology": "NCBITaxon:9606",

(note the underscore vs colon in the CURIE)

Our current conversion to ontology fields in BioSamples JSON is:

    "organism" : [ {
      "text" : "Homo sapiens",
      "ontologyTerms" : [ "http://purl.obolibrary.org/obo/NCBITaxon_9606" ]
    } ],

So I think this is OK for now.

In the HCA we use "ontology": "NCBITaxon:9606",, and I'm not sure if we have plans to store the URI instead or keep the colon usage and build the URI if needed. @daniwelter might know.

aaclan-ebi commented 5 years ago

For BioStudies project contacts, is it safe if I exclude the contributor with project role Human Cell Atlas wrangler, or should I just search for wrangler keyword (which might not be safe as well) Hmm, this is not an enum field, right?