HumanCellAtlas / dcp

Data Coordination Platform manifest and integration tests.
3 stars 1 forks source link

Ensure that projects can be cited, for example in publications #226

Open tburdett opened 5 years ago

tburdett commented 5 years ago

November 19 Update The remaining Phase 1 Data Browser tickets will complete in Q4M3. The basic "add a citation link" functionality is merged but there was no promotion to prod this week. The other 2 tickets cover a re-design of the Project Details page to accommodate a separate citation/attribution section, requiring an additional "tab" design that was not originally foreseen.

Phase 1 - Stable non-versioned citable project URLs

See details in https://github.com/HumanCellAtlas/dcp-community/blob/master/rfcs/text/0014-data-citation-plan.md

The description below is obsolete

NOTE: we need to scope this with the OC since we don't want to over-engineer this for Q2 specifically. Options:

Overview: Data consumers and contributors have the ability to cite data in the DCP and have their own data cited by others.

User Story: As a researcher with a keyboard I wish to cite the data I use and as a researcher with a pipetta I wish to understand how the data I submit can be cited.

Delivery at the end of Q2: The portal team ensures that projects (entities with URLs like https://staging.data.humancellatlas.org/explore/projects/bf4c505a-1f32-40e3-8a29-a19a94c6dabe) use 1) stable URLs and 2) those URLs have an associated DOI. The DOIs support versioning so by pointing to the base DOI for that project, the user can see moment in time snapshots (via specific sub-DOIs) for the project made at intervals TBD. If they simply want to refer to the overall project via a DOI they can do so using the base DOI (see below for more details)

Evaluation Plan - how this will be evaluated at the end of the quarter: There is a tutorial (on or pointed to from our “Guides” portal section) from the content team on our support for project citations via DOIs.

Needs from Responsible Groups:

Note: I confirmed this proposal above with @kozbo and @theathorn on 20190417 and they have/will link appropriate Epics from across their teams (and ask other teams to do the same) to these theme epic via ZenHub to track the work to be done in Q2.

For more information see: https://docs.google.com/spreadsheets/d/1iAL2JR3ndgMmYwojUU0kU8Pm7B7ycRLt_973seREd3M


About: Tony's original comments on this Epics are below. Above, I'm following a template for the theme epics we're using as part of Q2 planning and beyond.

MVP User Story

MVP Implementation Notes

Post MVP User Story

Post MVP Implementation Notes

theathorn commented 5 years ago

Asking for DOI (may need DOI service). Scientists need this for GA.

morrisonnorman commented 5 years ago

@theathorn This task should not be owned by ingest as it is related to stable data consumption. Who should take this on?

theathorn commented 5 years ago

See doc from @gabsie : https://docs.google.com/document/d/1eM80EGe3T4VTU5hyBUKCGdN17k_MMxsRcWwHBB-54n4/edit#heading=h.wgzkwbvrtz50

theathorn commented 5 years ago

Input from @benedictpaten :

theathorn commented 5 years ago

@briandoconnor Please add your input on MVP requirements.

tburdett commented 5 years ago

As far as I can see, there's nothing for the Ingestion Service to do here, besides assigning UUIDs to a project. @theathorn can you ping me if there's more you need?

lauraclarke commented 5 years ago

There has been a lot of discussion on this ticket about DOI assignment and data immutability

https://github.com/HumanCellAtlas/data-browser/issues/550#issuecomment-484143311

It would be good to consider this when defining the first iteration.

I will ask the question here that I asked there. It isn't clear what value introducing the collection service into this first iteration has? It seems to add delays and misdirection to a project being citable rather than value. The collection service allowing users to create arbitary sets of data and share them is a fantastic plan but it seems orthogonal to project level citation.

lauraclarke commented 5 years ago

Another important issue to note, and where it would be useful to have feedback from @briandoconnor and @theathorn on the Updates epic

https://github.com/HumanCellAtlas/dcp/issues/222

As updates that change the experimental design representation in the data store will still fall back to the old exclude and reingest from scratch update mechanism we can't yet commit to having entirely stable project UUIDs because if major updates are needed the only solution this quarter is likely to be deleted and reingest.

We should discuss what the best mitigation strategy is to minimize consumer confusion when this inevitably does happen.

theathorn commented 5 years ago

MVP for Q2:

lauraclarke commented 5 years ago

Wait a minute, why are we using zenodo for DOI assignment? that sounds like a consequential discussion which needs talking about at management level (is this tech-arch or PM, not sure)

lauraclarke commented 5 years ago

I did a bit of digging and found a DOI assignment process we used for a former project Blueprint which used the EBI service to generate DOIs, It is a very simple process though I suspect there are subtleties to discuss

DOI_instructs.pdf

Very happy to start conversations with our literature services team about our plans to see if this would be a suitable solution. If DOI assignment is in scope for this quarter it feels better to use a service which is much closer to one of the collaborating institution than an entirely third-party service.

lauraclarke commented 5 years ago

Thinking about this more, it would seem a good idea to discuss this at PM/Tech Arch level and decide if we want to use someone elses service for this at all or if it would be better for the HCA to become an authority who can assign DOIs ourselves

I haven't read the Crossref membership terms in detail but this should be discussed

https://www.crossref.org/membership/

theathorn commented 5 years ago

Also see https://www.nature.com/articles/sdata201829.

theathorn commented 5 years ago

Moving to Q3 following email from @lauraclarke:

During the Tokyo meeting Norman brought up very good points about if we should be using Biostudies DOIs or not.

I don't think that question was resolved so the process of figuring out how to assign them hasn't started either.

theathorn commented 5 years ago

Discussed in backlog refinement meeting: Spike ticket is to more precisely define the requirements to determine if an acceptable MVP can be defined in the absence of a full AUDR implementation, without which any project that is cited may be subject to change.

briandoconnor commented 5 years ago

Retrospective on 20190919

This ticket presented 3 different options for how we use DOIs to cite projects.

Current State

The portal team confirmed our UUIDs for projects are stable and won't change (coming from ingest). The portal team then rolled out our production site data.humancellatlas.org on Aug 2nd to allow users to cite URLs for projects that won't change. For example:

https://data.humancellatlas.org/explore/projects/74b6d569-3b11-42ef-b6b1-a0454522b4a0

We didn't reach consensus on actually issuing DOIs. Early discussions leaned towards having DOIs for projects issued during ingest, issued by the EBI BioStudies system. See #560 . We thought we had consensus for this approach and the portal team was prepared to display these DOIs on the project page.

However, the conversation shifted to whether or not this is sufficient. Some felt that DOIs for a project where the underlying data can change is not good. And they felt we needed to cite specific versions of data. Without a release process it wasn't clear what specific file versions would be cited.

Conversation then shifted to writing an RFC via ticket #424 https://github.com/HumanCellAtlas/dcp-community/pull/103, that was opened on Aug 9th.

This RFC looked at a three phase plan:

Once the RFC is closed the portal team can implement at least phase 1 and phase 3 without dependencies on other teams.

Lessons Learned

diekhans commented 5 years ago

I am not convinced we need a DOI for a project, depending on how we define project. If it is like an NCBI BioProject, then this isn't something that is normally cited. Without DOIs of the actual data, the project may be used instead, which will be unFAIR when a project produces multiple experiments or updates an experiment.

I propose dropping project citations and adding "experiment datasets" with versioned DOIs. This means accelerating work on "experiment dataset", however, I believe this maps nicely to the current ingest submission model and a allowed a producer to have citable data without waiting on a release.

lauraclarke commented 5 years ago

I didn't think the definition of a project in this context was in question. It is the DCP metadata entity project which gets created when a submission is made and can be updated on subsequent submissions.

Why shouldn't those be citable or have DOIs associated with them

diekhans commented 5 years ago

@lauraclarke Is this definition of a project actually documented anywhere? A project is defined as an updatable submission makes sense to me, but it is the first I have ever have heard of it. If one is familiar with the NCBI terminology, it is not intuitive.

So you misunderstand my comment. If a project represents a set of data, the DOI needs to be versioned, otherwise, it is not a FAIR citation. The state of a project may radically vary between the time the citation was created and the present day.

If a project means something like a BioProject, then I don't understand why it should be cited.

The terms don't matter so much as a rigorous definition of them.

lauraclarke commented 5 years ago

We define the types of metadata are associated with a project but we don't explicitly define its role to contain all other metadata data entities, or if we do I can't find that anywhere

Our project is like the Bioproject primary submission project type https://www.ncbi.nlm.nih.gov/bioproject/docs/faq/#what-is-project-type

We don't support umbrella projects as currently, we don't have a need.

That said with funded initiatives like the seed networks, WSSS and the H2020 awards we might see that need arise in the future

As I said to @NoopDog in the project completion RFC I want to avoid overloading the word submission here.

A submission is an event which writes data to the ingest database/datastore. One project may be made up of multiple submission events (at least that is the plan).

Submission in the ingest context is a verb not a noun