chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
37 stars 23 forks source link

Add citation for data reuse policy #412

Closed brianraymor closed 1 year ago

brianraymor commented 1 year ago

Design

uns (Dataset Metadata)

...

When a dataset is uploaded, CELLxGENE Discover MUST automatically set its citation for data reuse. Curators MUST NOT annotate the following column.

citation (Take 2)

Key citation
Annotator CELLxGENE Discover
Value str. For example:

"Publication: https://doi.org/10.1126/science.abl4896 Dataset Version: https://datasets.cellxgene.cziscience.com/dbd8b789-3efa-4a63-9243-90cff64f2045.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/e5f58829-1a66-40b5-a624-9046778e74f5"

Template for citations:

Citation Element Value
Publication: Publication DOI for the collection

This element MUST be present
if the collection includes a publication DOI;
otherwise, it MUST NOT be present.
Dataset Version: Permanent url to the version of the dataset
curated and distributed by
CZ CELLxGENE Discover in Collection:
Permanent url to the collection

citation (Take 1)

Key citation
Annotator CELLxGENE Discover
Value str."Dataset Version: permanent url to the dataset version curated and distributed by CZ CELLxGENE Discover in Collection: permanent url to collection". For example:

"Dataset Version: https://datasets.cellxgene.cziscience.com/dbd8b789-3efa-4a63-9243-90cff64f2045.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/e5f58829-1a66-40b5-a624-9046778e74f5"



Appendix A. Changelog

schema v4.0.0


Context

A citation acknowledges our data contributors, and allows us to track the use and impact of our data and service.

Also see Reproducibility of results used for research obtained through CELLxGENE Discover.

Potential Citation Elements

Potential elements for a citation (emphasis documents preferences from earlier triages):

  1. permanent url to the collection
  2. [Collection] Title
  3. [Collection] Consortia
  4. [Collection] Publication DOI - @jychien - "Is there an alternative when a DOI is not present?" (Note, there are only 6 collections without a DOI)
  5. [Dataset] Title
  6. " curated and distributed by CZ CELLxGENE Discover at permanent URL to this version of the dataset

My perspective is that the citation should only contain immutable elements. Otherwise, revisions to collection metadata would require datasets in the collection to be re-processed which includes RDS and CXG conversions. For example, if a DOI was added or a collection name was changed.

Prototyping

Immutable elements Dataset Version: https://datasets.cellxgene.cziscience.com/dbd8b789-3efa-4a63-9243-90cff64f2045.h5ad (or .rds curated and distributed by CZ CELLXGENE discover in Collection : https://cellxgene.cziscience.com/collections/e5f58829-1a66-40b5-a624-9046778e74f5

Publication DOI doi:10.1126/science.abl4896 : curated and distributed by CZ CELLxGENE Discover at https://datasets.cellxgene.cziscience.com/<version id>.h5ad (or .rds)

No Publication curated and distributed by CZ CELLxGENE Discover at https://datasets.cellxgene.cziscience.com/<version id>.h5ad (or .rds)

More collection metadata Tabula Sapiens : Tabula Sapiens - All Cells, doi:10.1126/science.abl4896, [CZ Biohub, CZI Single-Cell Biology], curated and distributed by CZ CELLxGENE Discover at https://datasets.cellxgene.cziscience.com/<version id>.h5ad (or .rds)


pablo-gar commented 1 year ago
  1. Permanent URL to this version of the dataset

@brianraymor is this possible? The permanent URL will be created after ingestion and once the H5AD has been validated and labels have been appended?

brianraymor commented 1 year ago

@danieljhegeman assured me that it was possible or maybe he meant Possimpible. it's a permanent url to the version of the dataset.

danieljhegeman commented 1 year ago

@pablo-gar @brianraymor correct! All Dataset download links are for Dataset versions, i.e., they are completely static.

ambrosejcarr commented 1 year ago

I like the user story:

A citation acknowledges our data contributors, and allows us to track the use and impact of our data and service.

Unless doi updates are very expensive, To optimally acknowledge our data contributors, I think should include publication doi in the citation, when we have it.

Social pressures will motivate scientists using CELLxGENE data to cite their colleagues. I believe that if we include the doi in the format, when it's present, we get closer to citing the work of the scientist and more scientists will use our citation.

brianraymor commented 1 year ago

Unless doi updates are very expensive

If the Publication DOI was added or changed AFTER datasets are uploaded, then these datasets in the collection would need to be re-processed in some manner to inject the DOI into their citation. I will review costs with @nayib-jose-gloria.

pablo-gar commented 1 year ago

@pablo-gar @brianraymor correct! All Dataset download links are for Dataset versions, i.e., they are completely static.

@danieljhegeman I'm wondering about the sequence of events here. I understand the links can be created and they currently exist. But for them to be automatically added to the h5ad by the validator, they need to exist before the validator is run at ingestion time.

pablo-gar commented 1 year ago

My strong preference is the prototype "More collection metadata", for citation the more the better imo.

brianraymor commented 1 year ago

for citation the more the better imo.

@pablo-gar - have you modeled what the collection of citations might look like for a slice of the census if this was the case?

I missed an opportunity to name "More collection metadata" as "in for a penny, in for a pound" ;-)

jychien commented 1 year ago

It is typically the case that the doi is not available at the beginning of wrangling and at times is not added until after the Collection has been made public. This is because majority of people that want their data uploaded is so that reviewers for publications can have access to the data.

brianraymor commented 1 year ago

Thanks for confirming, @jychien. And there are also cases where the preprint DOI(s) are updated later to publication DOI(s).

pablo-gar commented 1 year ago

The citation as encoded in the h5ad will be added to the census datasets data frame. A user can connect any column of that data frame to any slice of the census via dataset_id which is in both datasets and obs data frames -- thus obtaining the citation for each cell (as well as dataset/collection title, doi, etc).

I'm planning on at least providing a notebook that shows the process of creating a concatenated string or list of citations for a slice. And at most we could provide some sugar API to automate this process, but I see the latter as low priority.

jahilton commented 1 year ago

I think should include publication doi in the citation, when we have it.

nayib-jose-gloria commented 1 year ago

@pablo-gar @brianraymor correct! All Dataset download links are for Dataset versions, i.e., they are completely static.

@danieljhegeman I'm wondering about the sequence of events here. I understand the links can be created and they currently exist. But for them to be automatically added to the h5ad by the validator, they need to exist before the validator is run at ingestion time.

@pablo-gar We currently create an empty dataset version object and generate the dataset version ID before we validate a dataset, in order to track processing status. So at the post-validation / write-labels step we can determine what the download link will be after processing and set it. By the time a user sees it when downloading an h5ad, the dataset will have finished processing and it will be a valid link to that dataset version.

We will have to make a small change to pass the version ID into the validation + write-labels command. But otherwise it should be fine

nayib-jose-gloria commented 1 year ago

Unless doi updates are very expensive

If the Publication DOI was added or changed AFTER datasets are uploaded, then these datasets in the collection would need to be re-processed in some manner to inject the DOI into their citation. I will review costs with @nayib-jose-gloria.

There's a way to do this such that it's slightly cheaper (in terms of AWS costs) and faster than a typical dataset reprocessing. So whether it's worth doing depends on how often we expect the collection metadata found in the citation to change. If it's about as often as we expect a collection's datasets to need changes via a typical revision process, then we should go for it. If it's significantly more often, we can dig deeper into estimated costs to determine if it's worth it. It would take a few weeks of dev time to add the additional logic / aws infrastructure needed.

Another note--if we are ok with eventual consistency (~up to 1 month delay), we would pick up any collection metadata changes as part of minor schema update migrations and get them for "free" as part of that monthly process. We can update the citation in the API/database instantly, and then update the h5ads/rds afterwards as part of that process.

ambrosejcarr commented 1 year ago

My strong preference is the prototype "More collection metadata", for citation the more the better imo.

I didn't feel the more collection was useful because there would be a very good chance that the additional information would have discrepancies from the paper citation and could create confusion. If adding more information, I would try to get it from crossref via the doi.

Could you explain why you have a strong preferences for the study title?

brianraymor commented 1 year ago

@ambrosejcarr @jahilton

To resist it is useless, it is useless to resist it

How can I resist the two of you when you agree? See Take 2 in the top-level summary comment.

ambrosejcarr commented 1 year ago

@ambrosejcarr @jahilton

To resist it is useless, it is useless to resist it

How can I resist the two of you when you agree? See Take 2 in the top-level summary comment.

Love it. 👍

jahilton commented 1 year ago

LGTM