IDR / idr-metadata

Curated metadata for all studies published in the Image Data Resource
https://idr.openmicroscopy.org
14 stars 24 forks source link

Study publication: metadata unification #375

Open sbesson opened 5 years ago

sbesson commented 5 years ago

Status

The gallery UI work carried in prod67 (see https://github.com/openmicroscopy/design/issues/100 and image.sc post) also drove the re-annotation of published IDR studies. In particular the Study Type and Study Public Release Date metadata fields were reviewed across all studies and a new Sample Type field was added to classify each study as cell or tissue.

Metadata that was discussed but not fixed/rationalized in prod67 was the Publication Authors. At the moment, we support different naming schemes and downstream consumers like the gallery UI needs to handle these variants.

Proposal

All IDR studies with an associated peer-reviewed publication have a PubMed ID. A natural proposal would be to unify the author naming scheme to comply with what PubMed store.

To minimize the impact on submitters, templates should be updated with the recommended formatting for Study Author List values as LastName 1 Initials1, LastName2 Initials2,.... The author list should be stored as a comma separated list of authors e.g.

Walther N, Hossain MJ, Politi AZ, Koch B, Kueblbeck M, Ødegård-Fougner Ø, Lampe M, Ellenberg J

Validation

The NCBI API can be used for validating a lot of the publication metadata (title, authors, PMC and DOI if applicable) given a PubMed ID:

+    def validate_publications(self):
+       URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
+       QUERY = "?db=pubmed&id=%s&retmode=json" 
+
+       for publication in self.study["Publications"]:
+           if "PubMed ID" not in publication:
+               continue
+           json = requests.get(URL + QUERY % publication["PubMed ID"]).json()
+           result = json['result'][publication["PubMed ID"]]
+
+           self.log.debug("Validating publication title")
+           assert publication["Title"] == result['title'], "%s != %s" % (
+               publication["Title"], result['title'])
+
+           self.log.debug("Validating publication author")
+           assert publication["Title"] == result['title'], "%s != %s" % (
+               publication["Title"], result['title'])
+
+           # Validate PMC ID and DOI if present
+           for articleid in result['articleids']:
+               articleids_map = {"pmc": "PMC ID", 'doi': "DOI"}
+               if articleid['idtype'] in articleids_map.keys():
+                   study_key = articleids_map[articleid['idtype']]
+                   self.log.debug("Validating %s" % study_key)
+                   assert publication[study_key] == articleid['value'], (
+                       "%s != %s" % (
+                       publication[study_key], articleid['value']))

Database and UI representation

At the moment, publications are included in the idr.openmicroscopy/study/info annotation as an ordered list of key/value pairs (Title, Authors, PubMed ID, PMC ID if applicable, DOI if applicable), one per publication:

Screen Shot 2019-06-18 at 14 33 46

In order for the gallery or any downstream application to consume this metadata effectively, we might need to rethink how to store and expose the publication metadata

manics commented 5 years ago

If a Pubmed ID is supplied could we dispense with a lot of the other related metadata and pull it out automatically using the PubMed API?

For authors I think either

sbesson commented 5 years ago

If PubMed ID is supplied, I would minimally update the parser to ensure the metadata is consistent with the PubMed API. Unclear about dispensing it though especially as most studies come prior to peer-reviewed acceptance anyways.

The main problem I see with one map annotation per author is the case of studies with multiple publications (like the one above) as you lose the author/publication relationship.

manics commented 5 years ago

The main problem I see with one map annotation per author is the case of studies with multiple publications (like the one above) as you lose the author/publication relationship.

True, but the purpose of the IDR is to publish datasets, not publications. I think it's reasonable to say that the reason for including individual authors is so you can lookup a dataset associated with them, I can't think of a good usecase where someone would want to go author ⇔ publication, as opposed to author ⇔ dataset / publication ⇔ dataset, in the IDR.

sbesson commented 5 years ago

Extensively discussed the relationship between study and authors this morning with @jburel @jrswedlow @francesw @dominikl @pwalczysko and @will-moore . Below is a summary of the current IDR model:

From the discussion, there is a general agreement in the value of modelling, capturing and representing the concept of Study Authors. In a large majority of the studies, this might be similar to the authors of the associated publication but this needs more design. A few immediate questions: