Closed malloryfreeberg closed 5 years ago
@aaclan-ebi :)
Point 5. is done and successfully tested by @malloryfreeberg
Point 4. : Funding information is currently not covered in the Project entity in USI. @malloryfreeberg and/or @aaclan-ebi : Could you give me more information about what information we need to store under funding, please? At the same time I am going to get some information from the BioStudies team related this one, as well.
Currently, the HCA only has 3 funder fields (schema link):
"grant_title": {
"description": "The name of the grant funding the project.",
"type": "string",
"example": "Study of single cells in the human body.",
"user_friendly": "Grant title",
"guidelines": "Enter a title of approximately 30 words."
},
"grant_id": {
"description": "The unique grant identifier or reference.",
"type": "string",
"example": "BB/P0000001/1",
"user_friendly": "Grant ID"
},
"organization": {
"description": "The name of the funding organization.",
"type": "string",
"example": "Biotechnology and Biological Sciences Research Council (BBSRC); California Institute of Regenerative Medicine (CIRM)",
"user_friendly": "Funding organization"
}
HCA grant_id
maps to BioStudies grant_id
and HCA organization
maps to BioStudies Agency
.
Point 2 is done. Please verify.
Example: HCA Biomaterial -> BioSamples Sample:
"content": {
"describedBy": "http://schema.integration.data.humancellatlas.org/type/biomaterial/10.1.1/donor_organism",
"schema_type": "biomaterial",
"biomaterial_core": {
"biomaterial_id": "Q4_DEMO-donor_MGH30",
"biomaterial_name": "Q4 DEMO donor MGH30",
"biomaterial_description": "Description",
"ncbi_taxon_id": [
9606
]
},
"medical_history": {
"smoking_history": "yes"
},
"genus_species": [{
"text": "Homo sapiens",
"ontology": "NCBITaxon:9606",
"ontology_label": "label"
}],
"is_living": "no",
"sex": "unknown"
},
"submissionDate": "2018-09-12T02:59:58.368Z",
"updateDate": "2018-09-12T03:00:01.696Z",
"user": "anonymousUser",
"lastModifiedUser": "anonymousUser",
"uuid": {
"uuid": "d85ac5e2-733c-46ad-937f-db5d704aa177"
},
"events": [],
"accession": null,
"validationState": "Valid",
"validationErrors": []
}
{
"alias": "hca339",
"description": "Description",
"attributes": {
"HCA Biomaterial Type":[{
"value": "donor_organism"
}],
"HCA Biomaterial UUID":[{
"value": ""
}],
"Biomaterial - Biomaterial Core - Biomaterial Id": [{
"value": "Q4_DEMO-donor_MGH30"
}],
"Biomaterial - Is Living": [{
"value": "no"
}],
"Biomaterial - Medical History - Smoking History": [{
"value": "yes"
}],
"Biomaterial - Sex": [{
"value": "unknown"
}]
},
"releaseDate": "2018-09-12",
"sampleRelationships": [],
"taxonId": 9606,
"taxon": "Homo sapiens",
"title": "Q4 DEMO donor MGH30"
}
Test submission: http://wwwdev.ebi.ac.uk/biosamples/samples/SAMEA5574155
Point 2 is done. Please verify.
Looks great! The ontology links work for me.
@aaclan-ebi how hard would it be to do point 3 under "nice to haves"?
@malloryfreeberg I've pushed some changes to parse the contact name. If the contact name couldn't be parsed, that will be raised as an error in the REPORT.json output of the archiver script.
HCA Project contributors
"contributors": [{
"contact_name": "First,Middle,Last",
"email": "dummy@email.com",
"institution": "Fake Institution"
},{
"contact_name": "First2,Middle2,Last2",
"email": "dummy2@email.com",
"institution": "Fake Institution2"
}],
BioStudies - Project contacts
"contacts": [{
"orcid": "",
"firstName": "First",
"middleInitials": "M",
"lastName": "Last",
"email": "dummy@email.com",
"address": "",
"affiliation": "Fake Institution",
"phone": ""
},{
"orcid": "",
"firstName": "First2",
"middleInitials": "M",
"lastName": "Last2",
"email": "dummy2@email.com",
"address": "",
"affiliation": "Fake Institution2",
"phone": ""
}],
I also replaced the '|' delimiter to be ',' in publication authors:
HCA Project publications
"publications": [{
"authors": [
"Vento-Tormo R",
"Efremova M",
"Botting RA",
"Turco MY",
"Vento-Tormo M",
"Meyer KB",
"Park J",
"Stephenson E",
"Polanski K",
"Payne RP",
"Goncalves A",
"Zou A",
"Henriksson J",
"Wood L",
"Lisgo S",
"Filby A",
"Wright GJ",
"Stubbington MJ",
"Haniffa M",
"Moffett A",
"Teichmann SA"
],
"publication_title": "Reconstructing the human first trimester fetal-maternal interface using single cell transcriptomics",
"doi": "10.1101/429589",
"publication_url": "https://www.biorxiv.org/content/early/2018/09/29/429589"
}]
BioStudies Project publications
"publications": [{
"pubmedId": "",
"doi": "10.1101/429589",
"articleTitle": "Reconstructing the human first trimester fetal-maternal interface using single cell transcriptomics",
"authors": "Vento-Tormo R , Efremova M , Botting RA , Turco MY , Vento-Tormo M , Meyer KB , Park J , Stephenson E , Polanski K , Payne RP , Goncalves A , Zou A , Henriksson J , Wood L , Lisgo S , Filby A , Wright GJ , Stubbington MJ , Haniffa M , Moffett A , Teichmann SA"
}]
Point 2 update:
Example: HCA Biomaterial -> BioSamples Sample:
"content": {
"describedBy": "http://schema.integration.data.humancellatlas.org/type/biomaterial/10.1.1/donor_organism",
"schema_type": "biomaterial",
"biomaterial_core": {
"biomaterial_id": "Q4_DEMO-donor_MGH30",
"biomaterial_name": "Q4 DEMO donor MGH30",
"biomaterial_description": "Description",
"ncbi_taxon_id": [
9606
]
},
"medical_history": {
"smoking_history": "yes"
},
"genus_species": [{
"text": "Homo sapiens",
"ontology": "NCBITaxon:9606",
"ontology_label": "label"
}],
"is_living": "no",
"sex": "unknown"
},
"submissionDate": "2018-09-12T02:59:58.368Z",
"updateDate": "2018-09-12T03:00:01.696Z",
"user": "anonymousUser",
"lastModifiedUser": "anonymousUser",
"uuid": {
"uuid": "d85ac5e2-733c-46ad-937f-db5d704aa177"
},
"events": [],
"accession": null,
"validationState": "Valid",
"validationErrors": []
}
{
"alias": "hca339",
"description": "Description",
"attributes": {
"HCA Biomaterial Type":[{
"value": "donor_organism"
}],
"HCA Biomaterial UUID":[{
"value": ""
}],
"Biomaterial Core - Biomaterial Id": [{
"value": "Q4_DEMO-donor_MGH30"
}],
"Is Living": [{
"value": "no"
}],
"Medical History - Smoking History": [{
"value": "yes"
}],
"Sex": [{
"value": "unknown"
}]
},
"releaseDate": "2018-09-12",
"sampleRelationships": [],
"taxonId": 9606,
"taxon": "Homo sapiens",
"title": "Q4 DEMO donor MGH30"
}
Test submission: http://wwwdev.ebi.ac.uk/biosamples/samples/SAMEA5574183
Nice to have # 3 project tag in BioSamples is done.
Must-have # 4 (Pass HCA Funders metadata from HCA to USI) is done.
@malloryfreeberg For Nice-to-have # 4 (Remove HCA Wranglers and external curators from list of authors submitted to BioStudies) Is there a field in the HCA contributor metadata that we could check to know who to include as BioStudies contact?
[minor comment]
Re ontology links it'd be better to use the full URI
"ontology": "http://purl.obolibrary.org/obo/NCBITaxon_9606",
instead of
"ontology": "NCBITaxon:9606",
(note the underscore vs colon in the CURIE)
@malloryfreeberg For Nice-to-have # 4 (Remove HCA Wranglers and external curators from list of authors submitted to BioStudies) Is there a field in the HCA contributor metadata that we could check to know who to include as BioStudies contact?
@aaclan-ebi project_role
[minor comment] Re ontology links it'd be better to use the full URI
"ontology": "http://purl.obolibrary.org/obo/NCBITaxon_9606",
instead of"ontology": "NCBITaxon:9606",
(note the underscore vs colon in the CURIE)
Our current conversion to ontology fields in BioSamples JSON is:
"organism" : [ {
"text" : "Homo sapiens",
"ontologyTerms" : [ "http://purl.obolibrary.org/obo/NCBITaxon_9606" ]
} ],
So I think this is OK for now.
In the HCA we use "ontology": "NCBITaxon:9606",
, and I'm not sure if we have plans to store the URI instead or keep the colon usage and build the URI if needed. @daniwelter might know.
For BioStudies project contacts, is it safe if I exclude the contributor with project role Human Cell Atlas wrangler
, or should I just search for wrangler
keyword (which might not be safe as well) Hmm, this is not an enum field, right?
The following items are considered "must-haves" for being able to submit HCA prod datasets to ENA, BioStudies, and BioSamples prod archives.
Release date (ENA) - The release date for all submissions must be set to the HCA submission date; otherwise, the submissions will not be viewable (i.e. will be private) in the archive. Currently ENA release is set for 2 years in the future. It is unclear if this issue should be addressed by the USI or ENA.
Qualified metadata field names (BioSamples) - Currently, the HCA metadata field names submitted through USI to BioSamples are the fully-qualified names, which makes them unwieldy and hard to read. The unqualified field names should be submitted instead. Also would like to format the HCA ontologized field so there isn't as much duplication. I think this is an issue to be addressed by HCA.
ontology_label
field and make sure thetext
andontology
fields are adjacent.HCA project accession (BioStudies) - Currently, the HCA Project UUID is embedded in the "alias" field in the BioStudies submission. Other archives have dedicated HCA UUID fields (e.g. "HCA Biomaterial UUID" in BioSamples). We should be submitting an "HCA Project UUID" key:value to BioStudies to uniquely reference the HCA Project. The BioSamples "alias" field is not required by BioStudies. Now being tracked in https://github.com/HumanCellAtlas/ingest-central/issues/344
HCA funding and publication information (BioStudies) - Need to test whether this information in an HCA project gets correctly submitted to BioStudies. If it doesn't, need to figure out how to do it.
Fix HCA contact ORCID IDs regular expression that is validated in USI.
The following items are considered "nice-to-haves" for being able to submit HCA prod datasets to ENA, BioStudies, and BioSamples prod archives.
HCA metadata duplicated (ENA) - Project and experiment XML contain duplicate metadata where specific ENA fields were set based on HCA metadata. The HCA metadata that was used to set these fields is also presented.
HCA project tag (BioSamples)
Currently looks like this:
Should look like this:
Remove contributors who are not part of the actual project from list of authors submitted to BioStudies. Now being tracked in https://github.com/HumanCellAtlas/ingest-central/issues/345
Release date (ENA) - The release date for all submissions must be set to the HCA submission date; otherwise, the submissions will not be viewable (i.e. will be private) in the archive. Currently ENA release is set for 2 years in the future. It is unclear if this issue should be addressed by the USI or ENA. Now being tracked in https://github.com/HumanCellAtlas/ingest-central/issues/346