ga4gh / TASC

TASC aids the harmonisation of aspects of GA4GH's various products that would otherwise prevent different products from being used together conveniently.
https://www.ga4gh.org
8 stars 7 forks source link

Citing GA4GH standards #39

Open susanfairley opened 2 years ago

susanfairley commented 2 years ago

This issue has been raised recently in two Work Streams.

LSG have discussed the issue here: https://github.com/samtools/hts-specs/issues/179

In addition, Discovery have had this question in relation to citation of the Data Connect standard in the time period before a publication can be released.

As additional background, GA4GH is redeveloping its website, which could theoretically play a role in some of the possible solutions to this.

It would be useful for TASC to investigate and determine an approach that can, ideally, be applied consistently across GA4GH.

ianfore commented 2 years ago

Do we see the citing use case as different from the need to reference a standard from within data?

The identifiers discussion of recent years (identifiers.org, n2t.net, bioregistry.io) has recognized much in common to these use cases and a common approach to them.

A specific relevance to GA4GH is how GA4GH standards would be referenced from the service registry and service-info.

Most specifically, the type in service-info references a "Type of a GA4GH service" Most of following service-info response is specific to the implementation. The type field cites the service being used.

{
    "id": "drs.starterkit.federatedgenomics.org",
    "name": "Federated Genomics DRS service",
    "description": "Data Repository Service (DRS) instance serving public genomics datasets. Deployment of the GA4GH Starter Kit.",
    "contactUrl": "mailto:nobody@federatedgenomics.org",
    "documentationUrl": "https://apidocs.federatedgenomics.org/drs",
    "createdAt": "2022-07-10T09:00:00Z",
    "updatedAt": "2022-07-10T09:00:00Z",
    "environment": "development",
    "version": "1.0.0",
    "type": {
        "group": "org.ga4gh",
        "artifact": "drs",
        "version": "1.1.0"
    },
    "organization": {
        "name": "Federated Genomics",
        "url": "https://this-is-not-a-site.federatedgenomics.org"
    }
}

Following identifier practices using compact identifiers (Curies) the following approach may be useful ga4gh:drs/1.1.0

Use of the ga4gh namespace (#16 ) for GA4GH standards seems an appropriate use of the namespace. It can likely co-exist with the VRS use of the namespaces which indicates the VRS ids by type as part of the identifier.

michaelmhoffman commented 2 years ago

I see the use cases as distinct—I see citation as being used in documents such as journal articles that have primarily human readers. DOIs (in the form of a URL starting with https://doi.org/) are the most advantageous identifier for this. Non-DOI URLs could work. Anything else is going to be used in all sorts of systems, where first-class support for arbitrary identifier schemes is never happening. Even citing standards from extremely well-known organizations such as ISO is awkward compared to a DOI.

ajhpage commented 2 years ago

We do not currently have a consistent method for citing standards in journal articles (but I'd definitely be in favor of having one!). I think the motivation for writing a paper has often been exactly that - to create a citable reference. There was previously discussion about using the DOI approach and I have a sneaking suspicion that @mcourtot may know more about that and why it didn't turn into reality. I think the most common reference used has been the url of the documentation for the standard in question and that has been acceptable to editors. Here are two recent examples: https://doi.org/10.1093/bioinformatics/btab524 https://doi.org/10.1093/bioinformatics/btac010

mcourtot commented 2 years ago

We struggled with this quite a bit for DUO until the paper was published, as there was a related project which had a publication available, and this was cited by default - even thought we had specific instructions for citation in the DUO repository using a PURL. I like Zenodo for specifications, it supports versioning and provides a DOI. Maybe the GA4GH technical team would be willing to drive this, and then add CFF files to all GA4GH repos? At least we would have consistency in representing how the specs authors would like it to be cited, and setting this as a shared expectations may drive adoption?

jmarshall commented 2 years ago

Using Zenodo means depositing a copy of a specification with Zenodo, and the resulting DOI refers to the copy at a zenodo.org URL.

IMHO if GA4GH is a serious standards-setting organisation, it should be capable of using DOIs that point to GA4GH's canonical specification documents or landing pages. For example, I believe becoming a member of CrossRef would be a way to produce such DOIs. (This also of course largely presupposes that GA4GH is capable of maintaining a stable technical website containing specifications at stable URLs. This is not something GA4GH has focussed on to date, but to my mind it would also be part of being a serious standards-setting organisation.)

I previously attempted to summarise the options for DOIs that should be investigated in https://github.com/samtools/hts-specs/issues/179#issuecomment-1102492363. Also as noted in the samtools/hts-specs#179 discussion, there are some other options that should be investigated in addition to DOIs.

ianfore commented 2 years ago

Given that what I posted here https://github.com/samtools/hts-specs/issues/179#issuecomment-1305951142 came out of a TASC call perhaps this thread would have been a better place for it. Cross-linking.

Discussion continues in that other thread - which is maybe not so bad as it was a source of the actual need came from.

andrewyatz commented 2 years ago

My feeling here is we have a number of issues colliding with each other such as

  1. Creating DOIs which point to a long-term archive of a standard (the Zenodo method)
  2. Creating DOIs to point to active documentation/artefact of a live standard (the CrossRef method)
  3. A manuscript which is cited with a DOI (publication)

Where are the priorities here? Have I missed another use-case

mshadbolt commented 2 years ago

Coming to the party late here and not an expert but I think Zenodo would be a great option for a lot of reasons. Chief among them that it is set up ready and easy to use, and perhaps could be a solution until GA4GH sets up something more permanent or decides to mint dois and provide stable long term storage. It gives you the ability to cite properly, attribute authorship properly, doi for every version that is uploaded as well as a url that always resolves to the latest version (see here for more info on that). Plus integrations (OpenAIRE/ORCID), APIs etc. You can also set up 'communities' that group together everything e.g. https://zenodo.org/communities/australianbiocommons/?page=1&size=20 . Getting metrics on views and downloads could also be a useful feature.

I don't think this would negate the need to also publish in journals, but I think having something citable until a standard is published in a journal, as well as something that is update-able with new versions over time (that may not need to be re-published) is important.

Making records at fairsharing could also be an option e.g. Beacon entry https://fairsharing.org/FAIRsharing.6fba91. I like this because you can link together github, documentation, publication etc all in one place. I think you still need to store the standard somewhere stable outside of their platform though.

michaelmhoffman commented 2 years ago

I just noticed that standards are a first-class record type at CrossRef. See their Standards markup guide.

uniqueg commented 1 year ago

EDIT: Just realizing that I'm basically parroting what @mshadbolt has already said above. 100% agree. But CrossRef looks good, too, as @michaelmhoffman suggested. For me either Zenodo or CrossRef will be fine (or any other equivalent solution), as long as we have any solution that works for most use cases.


My perspective here: It would of course be great if GA4GH hosted its products itself and minted DOIs for them. But if (or as long as) that is not an option, this shouldn't stop us from solving this issue somehow in the meantime by creating guidelines for:

  1. Citing GA4GH products (this issue)
  2. Creating releases for GA4GH products that include where to host them, where to host the documentation and how to make them citable (agreeing with @michaelmhoffman: DOIs would be my favorite); there is already in issue for that: #46; ideally we could write a GitHub Action that products could easily include in their CI workflows

In fact, I believe that once 2. is available and adopted for all releases (past and future), then 1. becomes fairly trivial for the main use case of citing a specific release of a specific version. Citing a paper for a given product (if available) is complimentary, in my opinion, and instructions for citing such a paper vs the specs (or both) can be included in the standard, docs or an accompanying file somewhere.

I think this could be fairly easily done via Zenodo, e.g., see the RO-Crate 1.1 spec. It also allows to set one DOI for each release of a product, and one DOI for the product as a whole (which always points to the latest release).

As for citing unreleased discussions/proposals/merges: These could be referred to via GitHub permalinks, but the guidelines should probably hint at the risks and discourage such citations in favor of DOIs of stable release snapshots wherever possible.

andrewyatz commented 1 year ago

To give an update here. Angela has worked quite hard with CrossRef and we certainly have a way forward there. CrossRef supporting standards as a first class entity was a major reason for adopting them. The GA4GH technical team is looking at how to mint these identifiers and provide additional tooling to help GA4GH to create DOIs. What is still clear are issues around:

It's in this light I'd like to frame TASC's discussions. Certainly any resource wishing to mint DOIs via another method is welcomed to do so and GA4GH does not want to get in the way of this.

ajhpage commented 1 year ago

Hi all,

Right now we’re (I think) planning to add DOIs to each of the GA4GH.org pages that maps to a specific standard (eg., www.ga4gh.org/product/ga4gh-passports/) This page is intended to serve as a “one-stop-shop” for all things related to the standard, including past versions and associated documentation

I do wonder if we should also add to DOIs to the documentation page for each standard, and perhaps that would be suggestive of using predictable urls (eg., /passports, /passports-documentation, /passports-repository). But if CrossRef suggests obfuscation, they must have a good reason?

While I’m happy for groups to mint their own DOIs, I do think it would be wise to have some consistency across the suite. If we plan to mint DOIs for every standard and the WS does it independently, wouldn’t have two DOIs for the same thing be…less than desirable?

Signed, Learning As I Go

andrewyatz commented 1 year ago

Absolutely the requirements for DOIs are different depending on the part of the organisation you refer to. So the individual pages will make sense, but so would individual documents about standards and I think what TASC might be thinking about more than the top-level pages.

To quote CrossRef though about their reasons:

Suffixes are best when they include short strings that are easily displayed and typed but are ‘dumb’ - meaning, the suffixes contains no readable information, including metadata. Keep suffixes short. This makes them easier to read and to re-type. Remember, DOIs will appear online and in print. Remember, DOIs are persistent and not subject to correction or deletion.

As for the multiple minting, it would potentially be confusing but not the end of the world. I was more suggesting it as a stop-gap until we get this CrossRef work off the ground :)

jkbonfield commented 1 year ago

A DOI to a standard page allows citing of the overall standard, but not citing a specific version in use which can sometimes be vital for reproducability. I think both have merit.

DOIs can have metadata attached, the most obvious being a URL, but this also permits authors. Having versions of specs with DOIs mean the authors that arrive later on can still get credit for their input to that specific version of a specification, which is why I feel DOIs to spec versions is important.