Closed brianraymor closed 1 year ago
- Permanent URL to this version of the dataset
@brianraymor is this possible? The permanent URL will be created after ingestion and once the H5AD has been validated and labels have been appended?
@danieljhegeman assured me that it was possible or maybe he meant Possimpible. it's a permanent url to the version of the dataset.
@pablo-gar @brianraymor correct! All Dataset download links are for Dataset versions, i.e., they are completely static.
I like the user story:
A citation acknowledges our data contributors, and allows us to track the use and impact of our data and service.
Unless doi updates are very expensive, To optimally acknowledge our data contributors, I think should include publication doi in the citation, when we have it.
Social pressures will motivate scientists using CELLxGENE data to cite their colleagues. I believe that if we include the doi in the format, when it's present, we get closer to citing the work of the scientist and more scientists will use our citation.
Unless doi updates are very expensive
If the Publication DOI was added or changed AFTER datasets are uploaded, then these datasets in the collection would need to be re-processed in some manner to inject the DOI into their citation
. I will review costs with @nayib-jose-gloria.
@pablo-gar @brianraymor correct! All Dataset download links are for Dataset versions, i.e., they are completely static.
@danieljhegeman I'm wondering about the sequence of events here. I understand the links can be created and they currently exist. But for them to be automatically added to the h5ad by the validator, they need to exist before the validator is run at ingestion time.
My strong preference is the prototype "More collection metadata", for citation the more the better imo.
for citation the more the better imo.
@pablo-gar - have you modeled what the collection of citations might look like for a slice of the census if this was the case?
I missed an opportunity to name "More collection metadata" as "in for a penny, in for a pound" ;-)
It is typically the case that the doi is not available at the beginning of wrangling and at times is not added until after the Collection has been made public. This is because majority of people that want their data uploaded is so that reviewers for publications can have access to the data.
Thanks for confirming, @jychien. And there are also cases where the preprint DOI(s) are updated later to publication DOI(s).
The citation as encoded in the h5ad will be added to the census datasets
data frame. A user can connect any column of that data frame to any slice of the census via dataset_id
which is in both datasets
and obs
data frames -- thus obtaining the citation for each cell (as well as dataset/collection title, doi, etc).
I'm planning on at least providing a notebook that shows the process of creating a concatenated string or list of citations for a slice. And at most we could provide some sugar API to automate this process, but I see the latter as low priority.
I think should include publication doi in the citation, when we have it.
➕
@pablo-gar @brianraymor correct! All Dataset download links are for Dataset versions, i.e., they are completely static.
@danieljhegeman I'm wondering about the sequence of events here. I understand the links can be created and they currently exist. But for them to be automatically added to the h5ad by the validator, they need to exist before the validator is run at ingestion time.
@pablo-gar We currently create an empty dataset version object and generate the dataset version ID before we validate a dataset, in order to track processing status. So at the post-validation / write-labels step we can determine what the download link will be after processing and set it. By the time a user sees it when downloading an h5ad, the dataset will have finished processing and it will be a valid link to that dataset version.
We will have to make a small change to pass the version ID into the validation + write-labels command. But otherwise it should be fine
Unless doi updates are very expensive
If the Publication DOI was added or changed AFTER datasets are uploaded, then these datasets in the collection would need to be re-processed in some manner to inject the DOI into their
citation
. I will review costs with @nayib-jose-gloria.
There's a way to do this such that it's slightly cheaper (in terms of AWS costs) and faster than a typical dataset reprocessing. So whether it's worth doing depends on how often we expect the collection metadata found in the citation to change. If it's about as often as we expect a collection's datasets to need changes via a typical revision process, then we should go for it. If it's significantly more often, we can dig deeper into estimated costs to determine if it's worth it. It would take a few weeks of dev time to add the additional logic / aws infrastructure needed.
Another note--if we are ok with eventual consistency (~up to 1 month delay), we would pick up any collection metadata changes as part of minor schema update migrations and get them for "free" as part of that monthly process. We can update the citation in the API/database instantly, and then update the h5ads/rds afterwards as part of that process.
My strong preference is the prototype "More collection metadata", for citation the more the better imo.
I didn't feel the more collection was useful because there would be a very good chance that the additional information would have discrepancies from the paper citation and could create confusion. If adding more information, I would try to get it from crossref via the doi.
Could you explain why you have a strong preferences for the study title?
@ambrosejcarr @jahilton
To resist it is useless, it is useless to resist it
How can I resist the two of you when you agree? See Take 2 in the top-level summary comment.
@ambrosejcarr @jahilton
To resist it is useless, it is useless to resist it
How can I resist the two of you when you agree? See Take 2 in the top-level summary comment.
Love it. 👍
LGTM
Design
uns
(Dataset Metadata)...
When a dataset is uploaded, CELLxGENE Discover MUST automatically set its citation for data reuse. Curators MUST NOT annotate the following column.
citation (Take 2)
str
. For example:"Publication: https://doi.org/10.1126/science.abl4896 Dataset Version: https://datasets.cellxgene.cziscience.com/dbd8b789-3efa-4a63-9243-90cff64f2045.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/e5f58829-1a66-40b5-a624-9046778e74f5"
Template for citations:
This element MUST be present
if the collection includes a publication DOI;
otherwise, it MUST NOT be present.
CZ CELLxGENE Discover in Collection:
citation (Take 1)
str
."Dataset Version: permanent url to the dataset version curated and distributed by CZ CELLxGENE Discover in Collection: permanent url to collection". For example:"Dataset Version: https://datasets.cellxgene.cziscience.com/dbd8b789-3efa-4a63-9243-90cff64f2045.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/e5f58829-1a66-40b5-a624-9046778e74f5"
Appendix A. Changelog
schema v4.0.0
citation
Context
A citation acknowledges our data contributors, and allows us to track the use and impact of our data and service.
Also see Reproducibility of results used for research obtained through CELLxGENE Discover.
Potential Citation Elements
Potential elements for a citation (emphasis documents preferences from earlier triages):
My perspective is that the citation should only contain immutable elements. Otherwise, revisions to collection metadata would require datasets in the collection to be re-processed which includes RDS and CXG conversions. For example, if a DOI was added or a collection name was changed.
Prototyping
Immutable elements Dataset Version:
https://datasets.cellxgene.cziscience.com/dbd8b789-3efa-4a63-9243-90cff64f2045.h5ad
(or.rds
curated and distributed by CZ CELLXGENE discover in Collection : https://cellxgene.cziscience.com/collections/e5f58829-1a66-40b5-a624-9046778e74f5Publication DOI doi:10.1126/science.abl4896 : curated and distributed by CZ CELLxGENE Discover at
https://datasets.cellxgene.cziscience.com/<version id>.h5ad
(or.rds
)No Publication curated and distributed by CZ CELLxGENE Discover at
https://datasets.cellxgene.cziscience.com/<version id>.h5ad
(or.rds
)More collection metadata Tabula Sapiens : Tabula Sapiens - All Cells, doi:10.1126/science.abl4896, [CZ Biohub, CZI Single-Cell Biology], curated and distributed by CZ CELLxGENE Discover at
https://datasets.cellxgene.cziscience.com/<version id>.h5ad
(or.rds
)