chanzuckerberg / single-cell

A collection of documents that reflect various design decisions that have been made for the cellxgene project.
MIT License
4 stars 2 forks source link

Data consumers can use publication metadata to find datasets in the portal #216

Closed brianraymor closed 2 years ago

brianraymor commented 3 years ago

Stories

Based on recent UX Research - Be Confident About Dataset Quality, Ambrose observed in a conversation on single-cell-data-wrangling:

  1. A data consumer looking for datasets wants to limit their search to reputable scientists whose work they trust.
  2. A data consumer with a question about a dataset wants to know who to contact.
  3. A data consumer looking for a specific dataset wants to be able to find it by its 'colloquial name', following the pattern (first author, year) e.g. 'Azizi 2018'

1 requires a full author list. In practice, scientists usually limit their search to first/co-first and senior/co-senior authors.

2 is either the corresponding author(s) or the "contributor", who could be a curator. Creating clear UX around who should be contacted while respecting data ownership could be challenging.

There is a contact name in the collection for this purpose.

3 requires that first authors, last authors, and publication year be searchable.

Note, for sites that allow readers to download a citation such as Stress-induced RNA–chromatin interactions promote endothelial dysfunction, the RIS format defines an ID tag documented as the Reference ID for the publication which is the colloquial name described above; however, there is consensus to use a summary citation format instead:

Last name of first author (Publication Year) Journal abbrevation such as Ren et al. (2021) Cell.


UX Design

Create Collection Publication DOI link A Collection Publication Dataset Drawer


Product Design

For framework developers querying the cellxgene Portal API for a Collection DOI to pass as a parameter to services which return publication metadata including authors and publication date, there are currently some minor issues with the DOI values that require a bit more parsing:

'https://www.doi.org/10.1126/science.aba7721'  # www.doi.org instead of doi.org
' https://doi.org/10.1101/2020.03.31.016972'   # leading space not stripped

It's also not helpful that the portal requires and stores the full URL because both scheme and domain (https://doi.org) must be stripped before the DOI can be passed to such services:

https://api.crossref.org/works/10.1016/j.cell.2021.01.053
https://api.meta.org/work/doi:10.1016/j.cell.2021.01.053

Feb 22 2022: There was agreement to continue with the current modeling of the DOI as a URL for consistency with the other links. We can revisit whether we want to return a DOI curie (in a separate section of the response) in a future API update. The new code also guarantees that the scheme and domain are https://doi.org.

Changes to the Create Collection UX

The Create Collection UX must be updated to:

  1. Replace the DOI link with Publication DOI to clarify our intentions. See the related thread on single-cell-data-wrangling.
  2. Replace the full URL with a DOI curie
  3. Only allow one Publication DOI link to be added to the collection

curie := [ [ prefix ] ':' ] reference

The UX prompts with a read-only 'doi:' prefix and separator. The curator adds the reference. For example, '10.1016/j.cell.2021.01.053'.

When a curator adds a Publication DOI link to a collection, publication metadata is acquired by issuing a Crossref query for the DOI and then parsing the successful JSON response (or see XML response):

    "published": {
      "date-parts": [
        [
          2016,
          1,
          4
        ]
      ]
    },

    ...

    "author": [
      {
        "given": "Bosiljka",
        "family": "Tasic",
        "sequence": "first",
        "affiliation": []
      },
      {
        "given": "Vilas",
        "family": "Menon",
        "sequence": "additional",
        "affiliation": []
      },
      {
        "ORCID": "http://orcid.org/0000-0002-6466-5883",
        "authenticated-orcid": false,
        "given": "Thuc Nghi",
        "family": "Nguyen",
        "sequence": "additional",
        "affiliation": []
      },

Changes to Edit Details UX for both private collections and private revisions of public collections

Edit Details UX MUST be updated to:

  1. Add a Publication DOI using the requirements (only one DOI per collection) and process described in Changes to the Create Collection UX
  2. Update a Publication DOI using the process described in Changes to the Create Collection UX
  3. Delete a Publication DOI and its related publication metadata

Note: The portal needs a policy for Crossref failures which may be due to pending publications.


Required publication metadata

The following metadata is REQUIRED when a DOI is available:

  1. Authors [in order]
  2. Publication month, day, and year
  3. Publication journal [abbreviated is preferred]

author

The ordered list of authors must be stored in the database. It should be simple for a portal query to subsequently extract the primary author's last name for use in a citation format.

Feb 8 2022 Update see #single-cell-filter-by-metadata

In most cases, authors are individual scientists modeled as given, family in crossref:

{
      "given": "Golnaz",
      "family": "Vahedi",
      "sequence": "additional",
      "affiliation": []
}

but sometimes authors also include consortia modeled as name:

{
      "name": "the HPAP Consortium",
      "sequence": "additional",
      "affiliation": []
}
  1. If name is the first author, ensure that it’s captured for use in the summary citation.
  2. name(s) will not be included in the author filter, only individual scientists.
# assuming a successful https request - response is the request.json()

message = response['message']
# the ordered list of authors
authors = message['author']
# primary author's last name
display(message['author'][0]['family'])

preprint

There is some conditional behavior that is dependent on whether the DOI is a preprint or a journal publication.

# is this a preprint?
is_preprint = message['subtype'] == "preprint"

published

The publication month, day, and year must be stored.

In order of preference when multiple are included in the response:

  1. published-print
  2. published
  3. published-online

From Date

Field Type Required Description
date-parts Array of Number Yes Contains an ordered array of year, month, day of month. Note that the field contains a nested array, e.g. [ [ 2006, 5, 19 ] ] to conform to citeproc JSON dates

Feb 8 2022 Update see #single-cell-filter-by-metadata

  1. If publication month is unavailable, default to "1".
  2. If publication day is unavailable, default to "1".
# publication dates

published_date = {}

if 'published-print' in message :
    published_date = message['published-print']
elif 'published' in message:
    published_date = message['published']
elif 'published-online' in message:
    published_date = message['published-online']

# month of publication
month = published_date['date-parts'][0][1]
# year of publication
year = published_date['date-parts'][0][0]

container

In order of preference when multiple non-empty values are included in the response:

  1. short-container-title
  2. container-title
  3. institution [we should validate whether preprints only have this value set]
journal = ""

# noticed empty values for containers
if 'short-container-title' in message and message['short-container-title']:
    journal = message['short-container-title'][0]
elif 'container-title' in message and message['container-title']:
    journal = message['container-title'][0]
elif 'institution' in message:
    journal = message['institution'][0]['name']

display(journal)

Changes to the A Collection UX

The publication metadata is used to create a summary citation.

The Publication field in A Collection replaces the previous DOI field. Its value is an anchor element that is composed of the DOI href with the summary citation as the human-readable name:

<a href="https://doi.org/111.222">Ren et al. (2021) Cell</a>

Changes to the Dataset Drawer UX

Similar to A Collection, the dataset drawer needs to be refreshed for DOI and Publication changes.

further metadata (placeholder)

display(message['volume'])
display(message['issue'])
display(message['page'])
display(message['title'][0])

Automatically updating a preprint DOI

Update the DOI in the A single-cell transcriptional roadmap of the mouse and human lymph node lymphatic vasculature collection is an example of an updated DOI that was discovered during prototyping. If an existing preprint DOI is queried again AND it has been published since the previous query, then Crossref returns the published DOI in:

if is_preprint:
    try:
        published_doi = message['relation']['is-preprint-of']
       # the new DOI to query for ...
        if published_doi[0]['id-type'] == 'doi' :
            display(published_doi[0]['id'])
    except KeyError:
        pass

This would allow the portal to refresh preprint DOI(s) with their published DOI(s) on a regular cadence.

September 2 Update: Stanford discovered a case where the publishers failed to update the relationship between a preprint DOI and its publication DOI.

brianraymor commented 2 years ago

Per the call, I updated:

Required publication metadata

  1. Publication month and year

to

  1. Publication month, day, and year