co-cddo / open-standards

Collaboration space for discussing and exploring technical and data standards
134 stars 18 forks source link

A standard for persistently identifying documents #75

Closed edent closed 7 months ago

edent commented 3 years ago

Create A Challenge

I am creating this challenge on behalf of the Data Standards Authority, based on suggestions from the community

Title

A standard for persistently identifying documents and datasets, allowing academics to cite them in a convenient way.

Category

Challenge Owner

The Data Standards Authority was set up to make it easier and more effective to share and use data across government.

Short Description

Government publishes documents and data sets. These are often used in academic works. In order to aide discoverability, reuse, and ease of use - we propose that any new publication should have:

  1. A persistent resolvable identifier - which is decoupled from the URl of the document.
  2. (Potentially) a standard for how to cite the document. Similar to the OGL text.

User Need

Other needs TBD.

Expected Benefits

As a data-driven government, we want to be able to see where our data is being used. Having a common citation method enables us to see which publications have the most impact.

Creating an easily referenced identifier makes it easier for academics to use our work.

In the long term, it makes data easier to use and discover.

As part of our Open Government Partnership goals, we need to make it easier to work with open data.

Functional Needs

The current open standard - https://www.gov.uk/government/publications/open-standards-for-government/persistent-resolvable-identifiers#functional-needs - says:

URNs such as DOIs meet the requirement for persistence but do not meet the requirement for easy resolvability as they are only resolvable through separate services. DOIs, and the systems that support them, are designed to identify information assets such as documents whereas URLs have an unlimited scope.

and

URNs such as DOIs can also be created without conflict, but with a much higher governance barrier than URLs.

We believe that the time is right to re-examine these statements. DOI is now an established part of academia. We believe that DOI may increase the use of government data. Being able to measure the impact of our data publications will help us make the case for publication.

DOIs could be issued by the GOV.UK publishing system, or data.gov.uk, or the National Archives. Alternatively, they could be issued by each department - although is likely to be an unnecessary duplication of effort.

References:

stevenjmesser commented 3 years ago

Could the resolvable identifier have a /info page that displays information about the document?

For example, Open Data Communities has http://opendatacommunities.org/id/london-borough-council/lewisham for the linked data resource, but resolves http://opendatacommunities.org/doc/london-borough-council/lewisham for web browsers, so that humans can read information about the resource.

The DOI URN https://assets.publishing.service.gov.uk/media/5f85aa45e90e0732a2448113/20-10-05_DFID_OA_in_LMICs_-_final_report.pdf might be https://assets.publishing.service.gov.uk/doi/7710073, for example, and https://assets.publishing.service.gov.uk/doi/7710073/info could show the parent page on GOV.UK (which is displayed in the link: HTTP header for some assets).

frankieroberto commented 3 years ago

A related issue is that government documents are often published in multiple formats (for instance, PDF and HTML), and sometimes with multiple separate attachments (summaries, appendixes, CSV/Excel files, etc). Would there be one DOI per file, or one per conceptual "publication"? The latter would arguably be more helpful, but would be a lot harder to define and implement.

The simplest thing might be one DOI per publication landing page on GOV.UK, but these can often contain a whole bunch of different documents from different time periods, so wouldn't be very helpful for citations. See example.

carwash commented 3 years ago

Is there a use-case for DOIs here not already covered by linked data good practice? If you can ensure that the linked data identifier (URI):

philarcher commented 3 years ago

In theory, and in some cases in practice, DOIs are completely unnecessary. However, given the very real need for persistence and the very common phenomenon of link rot, it's entirely understandable that people jump to an ID system that exists independent of the domain name system and typical content management systems. DOIs are a solution that many people have found very useful (notably the academic publishing community) and if it's useful, then it will be used.

There are alternatives that may or may not be attractive.

Occam's Razor says the solution is simple: don't allow link rot. The best example of this is w3.org which has operated under a persistence policy since day 1 and that is codified at https://www.w3.org/Consortium/Persistence. That document is out of date but the underlying policy is very much alive and fully applied. See https://philarcher.org/diary/2020/importanceOfPersistence/ for an exploration of this.

I'm told that applying a persistence policy is not always possible. Personally, I think it's not as hard to achieve as some people say, but say it they do - and so we live with it.

The starting point at GS1 is a little different. We start from the fact of billions of physical items that carry an identifier that was designed and implemented well before the birth of the Web. We have class-level identifiers (the ones that go beep at the checkout); we have batch/lot numbers, location IDs, organization IDs and many more. Putting that existing offline system online is what the majority of my work is about. There were several principles that underpinned the design of what is now called GS1 Digital Link, the most relevant one here being that the existing identifier is the identifier. When put into a URL, the domain name is not part of the identifier.

We've also defined (and implemented) a system whereby if a URL containing the identifiers points to a conformant resolver, then it can be linked to any number of related resources, each of which is identified by a typed link. The resolver will redirect queries to the default link unless it is requested to do otherwise. Such requests can be either a request to be redirected to a particular type of associated resource if available (by putting the desired link type in the query string), or simply a request for all available links associated with the item (we're strongly supporting https://tools.ietf.org/html/draft-wilde-linkset-07 for this).

Much of this is similar to DOIs where an independent ID can be associated with metadata and/or redirected to a target. Where we differ is that GS1 resolvers are sovereign. That is, one resolver might point to different related resources than an other (resolving DOIs always redirects to the same location, each resolver being synchronised). That gives you a resilient, decentralized but interoperable system based on IDs that exist in their own right. All supported by lots of free open source code.

It's Linked Data, it's HATEOAS, it's the Web. If it has any currency for your current effort @edent, of course I can expand.

edent commented 3 years ago

Thanks for the comments - a few answers based on my understanding:

Is there a use-case for DOIs here not already covered by linked data good practice

Yes. Academic users seem to prefer DOIs. There are lots of tools which automatically generate a citation / bibliography from a DOI. That's not as easy with a URl.

Because of the way GOV.UK structures its publications, it is possible to have multiple URls for a single document. That makes it hard for users to chose which one to cite, and hard for others to see if multiple citations point to the same resource.

There are lots of automated tools for assessing how often a DOI has been cited. That's harder with a URl - especially if there are multiple resources.

An author of a document doesn't necessarily know where the document will be published, or what its URl will be. That makes it hard to embed a reference inside a document. Once a DOI is "minted" - that can be written into the document before publication.

Would there be one DOI per file, or one per conceptual "publication"?

I don't know. We need to solicit for feedback.

I'm told that applying a persistence policy is not always possible.

GOV.UK is pretty good at persistence. And nearly everything is "backed up" by The National Archives. But things get lost, or accidentally removed, or redacted / withdrawn. Using DOI doesn't solve these problems - but may make it easier to spot when problems occur.

In addition, it is worth noting that modern Data Warehouses prefer DOIs:

frankieroberto commented 3 years ago

Also worth noting that some (but not all) publications on GOV.UK have an ISBN already, and apparently ISBNs can already be expressed within the DOI system.

As I understand it though, ISBNs are meant to be unique to the format (eg hardback, softback, ePub, PDF) and edition of the publication rather than the conceptual "work" (for which there ISTC codes, although I've never seen anyone use that).

The first example I found on GOV.UK lists the same ISBN for both the PDF and HTML versions though, so I'm not sure how widely that's followed...

In short, the problem with all identification schemes is that no one ever agrees how they're meant to be used 😄.

philarcher commented 3 years ago

Also worth noting that some (but not all) publications on GOV.UK have an ISBN already, and apparently ISBNs can already be expressed within the DOI system.

As I understand it though, ISBNs are meant to be unique to the format (eg hardback, softback, ePub, PDF) and edition of the publication rather than the conceptual "work" (for which there ISTC codes, although I've never seen anyone use that).

The first example I found on GOV.UK lists the same ISBN for both the PDF and HTML versions though, so I'm not sure how widely that's followed...

In short, the problem with all identification schemes is that no one ever agrees how they're meant to be used 😄.

We (GS1) have been speaking to ISBN about all this recently (nothing conclusive, just a chat with a positive atmosphere). ISBN-A is indeed similar to GS1 Digital Link but with important differences. Most notably, you wouldn't need the detail of 10.978.12345/99990 which has separators between the 978, the company prefix (12345) and the actual ID. Also, DL leverages HTTP content negotiation so one ID can easily lead to the same thing in different formats.

gbilder commented 3 years ago

As per comments above, you are bound to get lots of questions like- "do different formats get different DOIs?", "why not just use URIs, URNs, UUIDs, or hashes?" (thankfully nobody advocating bollockschain yet). Critical to the discussion is that DOIs (as used by DataCite, Crossref and some other RAs in the scholarly communication space) are "citation identifiers". They are not accession identifiers, workflow identifiers or supply chain identifiers. As citation identifiers, they are expected to behave in certain ways. And though I'm the first to admit DOIs have many flaws, other approaches (UUIDs, hashes, URIs) have their problems too. Particularly when you try to use them as citation identifiers.

oughnic commented 3 years ago

Hello all I welcome this challenge. A broader discussion on how we identify data assets and data specifications over time is critical. @edent has brought out some of the key issues. Sharing data set specifications between government departments and with business and citizens is increasingly important. A few other requirements:

galund commented 3 years ago

One of the complaints here is about gov.uk URIs for PDF downloads (etc) being weird (https://assets.publishing.service.gov.uk/etc. If the PDF download was at a URI that related clearly to the page from which it was linked (and there's always a single page from which such a thing is linked, that usually has a summary and lists available formats), then it would be way more obvious. Because those URIs are content-designed, and don't just have random asset tags etc. This is fixable by GOV.UK (I would have thought).

Despite the international standard, I still don't see any reason for people to trust a system that (AIUI) just places an extra intermediary into the process of resolving a given reference. Either the owner of the resource is capable of maintaining the reference as their infrastructure/info architecture changes (and in which case they can competently keep a URI working just as much as a DOI), or not (in which case a DOI won't help).

oughnic commented 3 years ago

Despite the international standard, I still don't see any reason for people to trust a system that (AIUI) just places an extra intermediary into the process of resolving a given reference.

The use case I have is where the platform for publication changes several times over an asset's life time, ending up in National Archives. Having an indirection allows the persistent resolvable identifier to be persistent without having to persist the redirects on successor CMS platforms over decades. Perhaps I'm in a special use case having data definitions for valuable data that originated over 3 decades ago.

edent commented 3 years ago

I agree @galund. The URl design on GOV.UK could be better for downloadable assets. However, we have a cohort of users who have expressed a need for a reference standard which matches established academic practice.

There is also a use-case of a digital library which maintains a cache of works deposited via DOI. If the DOI cannot resolve to a working website, a service like DataCite can still return metadata about the work. And a search engine can serve up a copy of the file based on the DOI.

tommorris commented 3 years ago

We already have an excellent case study in a type of government-published document that is referred to frequently by a non-URI identifier: command papers. These are cited somewhat frequently by academics working in areas including law, politics, public policy, criminology and other social sciences. In law, for instance, OSCOLA (PDF)—the authority in the UK for citing legal authorities—includes mention of how to cite Command Papers at §3.4.3 (pp. 40-41), and the preferred method is to cite the department name, the title of the paper, the command paper number, the year, and any paragraph/page number e.g.

Home Office, Report of the Royal Commission on Capital Punishment (Cmd 8932, 1953)

(Incidentally, out of interest, Cmd 8932 is a document of some importance in British legal and social history as it helped paved the way for the eventual abolition of the death penalty in the UK, both by domestic legislation and by the ratification of Protocol 13 of the ECHR. It isn't available on the web, along with a lot of historical command papers pre-1997, some of which cover similarly interesting and historically important developments in British political history.)

There is currently no publicly available index on the web of all UK government command papers, or any way to dereference a command paper from the citation alone. I stress currently because historically, before the creation of GOV.UK and the transfer of the function of publishing command papers to GOV.UK and GDS, command papers could be looked up on by number on the old TSO official-documents.gov.uk site—here's an old page from the National Archives web archive showing a listing of every command paper published in 2010. There were also departmental listings.

Let's consider a practical example of a fairly randomly selected command paper published by the FCO in April 2016, Cm 9245. An HTML page exists on GOV.UK for this document here which links through to a variety of versions, including the canonical PDF version. This HTML page contains no mention of the fact it was published as Cm 9245. If you use the 'Official Documents' search on GOV.UK, it returns a search result page which contains no reference to the document in question, presumably because the HTML page that links to the PDF does not contain the Cm 9245 citation. If you do a general web search for Cm 9245, you may end up getting to a written statement on Parliament's website where a government minister officially announced the document to the Lords. The link in the statement does correctly link to the page on GOV.UK for the document. That's not exactly seamless.

This document was published over three years after GOV.UK launched, and now GOV.UK makes it almost impossible to find the document. Fortunately for lawyers, academics and others lucky to have a key behind the institutional paywall, the UK edition of the proprietary legal database Westlaw makes it extremely easy to look up command papers by series identifier, with a link back to the PDF. Good for Westlaw's paying users, I guess, but not so good for curious citizens who don't have the money or institutional affiliation who want to be good citizens and monitor exactly what their government is up to. (Also not great for Westlaw users who don't particularly want to jump through institutional/corporate single-sign-on nightmares to access what ought to be publicly accessible government information.)

Command papers are an interesting example, precisely because they are a succession of periodically rebooted (just before they overflow at 10,000) integer-referenced document series. There's nothing stopping someone in government from setting up a 302 redirect from https://gov.uk/command-papers/cm/9245 to the above command paper. One could also assign a URN (urn:govuk:cm:9245?) and even a DOI (doi:10.1234/cm9245?) if one so desired.

One could, except for the fact that metadata management has become so deprioritised by GOV.UK that the likelihood that GOV.UK is going to be able to reliably resolve all but a small handful of the 52,583 command papers published before 2019 (or the 301 papers published since) is quite low.

If the purpose is enabling the "tools which automatically generate a citation / bibliography from a DOI", then the lack of good metadata structure won't help much. Last time I checked, there's no meaningful effort on the part of GOV.UK to publish any type of inline (or sidecar) metadata—RDFa, microformats, Schema.org markup etc. Putting the above link to the FCO report (Cm. 9245) into the Zotero-based citation generator used by Wikipedia returns simply that it was published by GOV.UK, along with the title. It does not even currently correctly parse the date of publication correctly, because GOV.UK document pages mark up their dates as follows:

<div class="app-c-published-dates " lang="en">
    Published 21 April 2016
    <br>Last updated 21 July 2016
      — <a href="#history" class="app-c-published-dates__history-link govuk-link">see all updates</a>
</div>

(If only someone had come up with some way to mark up the dates and times when documents were published in a commonly-agreed machine-readable way...)

If the metadata isn't published, and the sequential numerical identification standards (plural, alas) used by the British government since 1870 aren't currently dereferencable, including for documents published within the last five years, the idea that adopting DOIs or another non-URL-based process will fix the discoverability, citation and dereferencing woes seems somewhat optimistic.

DRossiter87 commented 3 years ago

As I tweeted, we (BSI) are a registration agency for DOIs as we have them for our standards. I'm currently trying to identify who is the best point of contact (will let you know when I do)

One benefit for us is that we are able to use the DOI as a landing page to signpost to multiple locations. For example, here is the DOI for a standard: [https://doi.org/10.3403/30118619U], the landing page has a link to both the shop and its location on our subscription service. Interestingly, the shop, landing page, and subscription link all use the same UID (000000000030362830 in this instance) but its the DOI that connects them together.

This may be a method, for example, to have the PDF and HTML of a govt publication against the same persistent ID?

woodbine commented 3 years ago

There would be significant benefit to having DOIs for our business (a procurement analysis firm), as we would be able to reconcile errors and with individual publishers. Having a standard for identification would also establish consistency for agencies that are auto-publishing on behalf of public agencies (e.g. tender portal companies) who are currently using different interpretations of what constitutes a document and, for instance, a date of publication.

The use of a standard would also have a strong educational benefit, providing those who publish data with a greater awareness of the need to provide consistent and useful metadata on individual documents.

Finally, this is also a chance to indulge in a common bugbear, the lack of organisational identifiers for public bodies, we can probably implement a DOI standard without having persistent identifiers for public bodies, but to do so without first establishing these identifiers would be a significant oversight and would likely cause problems for publications in the long-term.

RickMoynihan commented 3 years ago

Disclosure: I work for a company helping government publish data on the web.

Just a small note to say there are already established and defacto standards for citing URLs on social media in the form of the Open Graph Protocol which itself leverages RDFa to embed appropriate citation metadata in HTML. This is currently well supported on the web by all major social media companies and digital content creators, so there are already a wide variety of tools to help create it.

Reviewing and/or extending these standards and recommending a profile for citing resources would be technically trivial. The challenge would largely be in getting it adopted in citation tools (if they don't already).

Obviously this only covers a small part of what DOIs provide.

I see also @tommorris is saying something similar.

jezcope commented 3 years ago

Disclosure: I work at the British Library, where I manage the DataCite consortium of UK-based government and research organisations that the BL leads; the Library was a founding member of DataCite and contributed to much of its initial design and infrastructure, though it's now wholly independent of us.

I've just had a good read through this thread, and I don't have much to add at this stage other than a few technical points. I think @gbilder captures perfectly the intended use-case of DOIs, along with @oughnic's comment that "Having an indirection allows the persistent resolvable identifier to be persistent without having to persist the redirects on successor CMS platforms over decades", all of which seems to match well what @edent seems to be looking for. One of the things I've learned over the years is that it's often necessary to have several "unique" identifiers for a single thing, as long as it's clear what value each is bringing. DOIs do an excellent job of supporting the long-term accessibility of cited objects even if those objects move across organisational boundaries, and come with a supporting cast of tools that make them easier to use in practice for researchers (such as the CrossCite citation formatting tool).

By default, a DOI resolves to a human-readable landing page (by design, since the main use case is for human to find documents cited in other documents). Metadata, or the referred object in other formats (e.g. PDF), can be requested using HTTP Content Negotiation or a plain GET to a structured URL.

If you want to get an idea of how DOIs interact with the other available persistent identifiers (PIDs) available, including those for people, organisations, funders, projects, etc., it's worth a look at the work that the FREYA project has done on the PID Graph

prateekbuch-policylab commented 3 years ago

A simple comment from a relative lay-person perspective. If a significant use case is for academics, you may wish to engage the government social research profession and analytical function more broadly - not unfair to characterise them as govt's internal academics, and the citation/reuse of govt research is commonplace - DOIs may be of benefit there too.

A question to ask of them is already raised here - at what level does the DOI operate, or rather, what constitutes a publication, is it single pages/PDFs/assets, or baskets of them? No immediate preference give technical workaround mentioned above by others.

Also keen to see how this develops for the command paper situation mentioned by @tommorris - I can imagine that a similar case could be made for regulations, on eg COVID19 or post-Brexit trade.

Will watch with interest, thanks!

northwestopendata commented 3 years ago

End using lay-person here - a few questions

oughnic commented 3 years ago

@edent Hello Terrence - If we do embrace such an identification system, perhaps the scope should be extended from documents and data sets to include digital assets such as images, schema, (in healthcare FHIR resource profiles), algorithms, derivation rules etc. So I suggest replace "documents and data sets" with "documents and digital assets"

oughnic commented 3 years ago

includes mention of how to cite Command Papers at §3.4.3 (pp. 40-41), and the preferred method is to cite the department name, the title of the paper, the command paper number, the year, and any paragraph/page number e.g.

One certainty in life is that the shape and form of government evolves over time as new priorities emerge. Department name feels particularly unstable.

woodbine commented 3 years ago

@oughnic organisation names is particularly problematic in Govt. I (and others) bang on endlessly about the need for robust, consistent identifiers for public entities, but there's an underlying problem that does make it hard. Currently different organisations are created and amended in different ways and there's no public record of those procedures. We really need to get to the bottom of what makes an entity, or a document, or a dataset come into being and then make sure that is the process that triggers a new identifier. If we don't tie identifiers to creation then we risk creating duplicate ids for what is essentially the same data. This tends to be easy where there's an authority that has domain over all of the data (e.g. Companies House), but where documents are spread over multiple institutions you also need to establish commonality of production.

ekoner commented 3 years ago

I welcome this challenge and have read the comments so far with interest.

The feedback below isn't directly from me, rather it represents the comments shared by my colleagues:

  1. We welcome this challenge in principle however raise questions about the relationship between DOIs and URIs in this context.
  2. It feels like it needs additional steps which aren't described - will creation of DOIs be built into a publication system like GOV.UK so that publishers have no additional actions? Note not all of the information we publish is on GOV.UK (for example large transparency datasets don't fit with the GOV.UK model, nor does software published on GitHub).
PeterParslow commented 3 years ago

I haven't read all the earlier comments.

All I will add here is that after engagement with parts of the UK academic community, through the British Geological Survey and the Marine Environment Data & Information Network, AGI's GEMINI metadata standard includes advice on several ways that DOIs can be used in a GEMINI record e.g. to identify the resource (dataset), link to a DOI landing page, or refer to a supporting document.

npch commented 3 years ago

Disclosure: I work for the Software Sustainability Institute, which has been working on best practice for citation of software and as part of this I collaborate with DataCite, one of the main providers of DOIs for datasets

I think that @gbilder has summed up the pros and cons well.

From an organisational perspective, the use of a machine-readable identifier such as a DOI, which allows for easy querying and access to associated metadata, and is easy to link to others is what we should be striving for to ensure that we can understand the wider impact of document, datasets and other outputs published by the government.

From a personal perspective, link rot is a real problem when I am researching and writing policy reports and guidance. While DOIs do not completely fix this issue, the previous finding that "[DOIs] do not meet the requirement for easy resolvability as they are only resolvable through separate services. DOIs, and the systems that support them, are designed to identify information assets such as documents whereas URLs have an unlimited scope." should be reexamined as I believe that the use of separate resolvers is actually a benefit and helps when documents are moved between domains, and that the continued access to metadata even if the artefact itself is not available helps understand what has been referenced. The only currently viable alternative, given that good practice around URL persistence has not been put in place already, is for all URLs used as references to point to the National Archive or Internet Archive copy of the document, which is not always possible and does not include metadata.

The DOI system isn't perfect, but the main thing is that it is relatively easy to use (it is easier to implement good practice for the user) and has shown it can work across hundreds of millions of documents, and be applied successfully to individual documents, as well as collections, series and versions, and to many different sorts of publications.

Therefore, I am in favour of the use of DOIs for all Government publications, including documents, data and code.

ChristopherCB commented 3 years ago

Disclosure: I work for Jisc and have been managing the PIDS for Open Access project looking at priority persistent identifiers and the potential of establishing a multi-PID consortium. I’m also a member of the British Library Datacite Consortium Advisory Board and the Research Organization Registry Community Advisory Group.

Jisc has been leading a PIDs for Open Access project aimed at expanding adoption and usage of persistent identifiers in the UK. This work builds on the 2019 report Developing a persistent identifier roadmap for open access to UK research. This is a community effort to establish a national PID strategy and involves stakeholders from all disciplines and sectors - funders, HEIs, infrastructure providers, libraries, publishers, researchers, research managers, etc. From this, five persistent identifiers have been deemed high priority for improving access to UK research. These are ORCID iDs for people, Crossref and DataCite DOIs for outputs, Crossref grant DOIs, ROR identifiers for organisations, and RAiDs for projects. The use of DOIs to integrate research outputs into the research ecosystem is described in There’s A PID For That, Part 5: Outputs. As part of developing a national strategy we are also looking at the benefits of running a multi-PID consortium.

Although this work is focussed on UK research it is relevant for a data-driven government that wants to encourage the discoverability and use of its outputs – publications and supporting data. I’d support the adoption of DOIs but it’s part of the solution. There are benefits from adopting DOIs that have already been described above, but as mentioned (@woodbine) organisation identifiers are another important identifier. The metadata that enriches the DOI and the integration with other identifiers will improve the interoperability between systems and enable discoverability. So, yes, adopting DOIs is an important step in delivering the benefits required.

MatthewWoollard-UKDS commented 3 years ago

Disclosure: I work for the University of Essex which is the host of the ESRC-funded UK Data Service. The UK Data Service promotes the use of PIDs for government created data which it holds.

Should we have a single prefix for the whole of government? Or should each department have their own prefix and their own infrastructure? What makes the most sense for users?

A single prefix for all government would be less confusing; government departments change names/functions periodically, and certainly within the lifespan of key documents and datasets. Also, note that like ISBNs DOIs are not supposed to be human understandable. UK Government already has a mixed mode for allocation of ISBNs; if a publication is produced through contracts managed by The National Archives it will allocate an ISBN; if a government organisation is the publisher it is responsible for allocating ISBNs.

We also need to understand whether DOI meets our criteria for an open standard. Does it have an open organisational structure? Will the specification be published openly? Is it in widespread use outside of government?

This consultation concerns convenient citation of government documents and datasets by academics. DOI is not the only option for persistent identifiers, but it is already widely understood and used for citation of academic publications.

Finally, we need to understand the impact adopting DOIs would have on users who don't understand about them. Will this introduce confusion or any other unintended consequences?

I imagine that there will be no confusion, especially if references are made to include a full URL (https://doi.org/10.1787/ec98f531-en) as opposed to just 10.1787/ec98f531-en. There would be considerable benefits within and beyond academia esp. from use of persistent identifiers for publication of official statistics. Use of a DOI would enable all interested parties (central and local government, third sector, academia, business often all analyse same data) to consistently cite and identify data sources which are simply not possible with citations such as "Source: 2011 Census" or "Source: ONS". There would be a substantial net decrease in user confusion.

There is also the issue about duplication – multiple 'publishers' making the same datasets/publications available through multiple routes. This is the difference between a canonical reference to a dataset and multiple recensions of it being published elsewhere. Our version of a dataset may differ in minor (and major) ways from the 'official' government version – and certainly the metadata will be different too.

Matthew Woollard -- on behalf of the UK Data Service. matthew@essex.ac.uk

epentz commented 3 years ago

Also worth noting that some (but not all) publications on GOV.UK have an ISBN already, and apparently ISBNs can already be expressed within the DOI system. As I understand it though, ISBNs are meant to be unique to the format (eg hardback, softback, ePub, PDF) and edition of the publication rather than the conceptual "work" (for which there ISTC codes, although I've never seen anyone use that). The first example I found on GOV.UK lists the same ISBN for both the PDF and HTML versions though, so I'm not sure how widely that's followed... In short, the problem with all identification schemes is that no one ever agrees how they're meant to be used 😄. We (GS1) have been speaking to ISBN about all this recently (nothing conclusive, just a chat with a positive atmosphere). ISBN-A is indeed similar to GS1 Digital Link but with important differences. Most notably, you wouldn't need the detail of 10.978.12345/99990 which has separators between the 978, the company prefix (12345) and the actual ID. Also, DL leverages HTTP content negotiation so one ID can easily lead to the same thing in different formats.

Clarifying a point - DOIs can also leverage standard Linked Data standards and Content Negotiation - https://www.doi.org/doi_handbook/5_Applications.html#5.4 - so one DOI can lead to the same thing in different formats. While Crossref uses Content Negotiation our application uses DOI as a citation identifier so we don't identify different format versions of content - but the ISBN-A application certainly could.

Ed Pentz, Executive Director, Crossref

edent commented 3 years ago

Response via email:

DOI were something we looked at while we did the open data work.

I was particularly keen as I was worried about the version of the flood data people were using and thought this would be a great way to track that and ensure the latest version. It would also help track and justify where the open data went to ensure we kept the support to internally publish data. Which sometimes became difficult.

I got as far as talking to the British Library. They were administering the DOI at the time and were willing to try it with us. As I said at the call the barrier was the cost. Each of the Defra group would have to pay 1500 to 2k and we didn’t have it. If a government group one could be negotiated, then it would be great.

JackBookerSSSC commented 3 years ago

The use of the Digital Object Identifier (DOI) standard in government

The Scottish Social Services Council (SSSC) is the regulator for the social service workforce in Scotland. Our work means the people of Scotland can count on social services being provided by a trusted, skilled and confident workforce.

We protect the public by registering social service workers, setting standards for their practice, conduct, training and education and by supporting their professional development. Where people fall below the standards of practice and conduct we can investigate and take action.

In addition to our regulatory function, we are also an Official Statistics and National Statistics provider. For example, we are responsible for annually publishing the Scottish Social Services Workforce Data report and the Mental Health Officers report.

The Adoption of the DOI standard

The SSSC supports Government moves to adopt digital object identifiers (DOIs) for public sector work. DOIs, along with associated metadata systems, will increase transparency by fulfilling FAIR principles, making work Findable, Accessible, Interoperable and Reusable.

The adoption of the DOI standard will allow ideas to be appropriately recognised. This will highlight organisations’ place in the flow of information. We determine this will benefit the government by making it easier to track which outputs are being used by others, and also the increased awareness of better citation will improve citation generally. Adopting the DOI standard will also mean publications will be able to better recognise the contributions of staff.

This would benefit us as we would be able to see clearly where the statistics we publish are being used or how reports that we publish are disseminated throughout the sector and beyond.

Providing persistent web links will improve the transparency and uptake of government work. It will also reduce the incidence of redundant web addresses being returned when searching for reports, making sure that the correct resources are always available.

Cost implications

We do not have any information regarding the financial cost for implementing the DOI standard. There are several platforms which already exist for providing DOIs and do so free of charge.

In order to address any costs associated with deploying the infrastructure required to implement DOIs, we feel that lessons should be learned from those currently using the DOI standard. Learning the lessons early on will ease deployment and should reduce potential delays to implementation.

Deployment

We feel that a reasonable deployment choice would be to add DOI functionality to existing services like data.gov.uk and statistics.gov.scot, although these would need to be extended to host other output types such as reports and presentations.

When the DOI standard is deployed, the priority should be making it as easy to use and accessible as possible for those creating outputs. There is likely to be a wide variety of ways government departments currently approve and release outputs, so streamlining this would be helpful. However, top-down constraints would not be welcomed and any new system must be easier than the current one.

Organisational structure of DOI

We feel the priority when implementing the DOI standard should be ensuring ease of use. The system used for DOI should be smoother and more user-friendly than the current system. Considerations regarding prefix identity should be secondary to user experience.

To our knowledge, DOIs are not in widespread use outside academia, but with ever increasing use of digital solutions, being able to consistently reference digital objects will only increase in importance.

Impact on users

Users who have not yet used DOIs are unlikely to see any difference to current operation, with the exception of a more consistent publication approach and metadata record. Ultimately DOIs will reduce confusion as their persistent nature will make government outputs more findable for a longer duration than they currently are.

Scottish Social Services Council December 2020

rhiaro commented 3 years ago

I am responding on behalf of Open Data Services Co-operative. We participate in the development of data standards already in use by the UK government, eg: 360Giving for grant data, Open Contracting for contract data, and IATI for international aid data.

Summary

What is helpful to users of government documents and data is stability of metadata access and identifiers. This can be achieved by outsourcing to doi.org at the cost of losing control over contents and location of metadata, control over the canonical URL of resources, and membership fees for joining (or becoming) a Registration Agency (RA). Outsourcing to doi.org does not solve the problem of link rot on government websites, nor completeness or accuracy of metadata for documents and datasets. An alternative would be for the UK gov to establish its own URL persistence policy and redirect service, and adopt a deliberately chosen appropriate metadata standard.

DOIs as a redirection layer

DOIs are most useful in their URL form (eg. CrossRef recommends to always use the hyperlinked doi.org version of a DOI; there are browser extensions to automatically convert DOI strings to DOI URLs).

DOIs used in this way are essentially a URL redirect service. As others in the thread have alluded to, this in itself does not solve link rot, and does not fix poor metadata. When generating a DOI for a document with a third-party service, it is still the responsibility of the registrant (in this case, a member of a UK government department) to update the URL that the DOI points to, if it changes, and to fill in and keep up to date appropriate metadata.

If DOIs are adopted, UK gov employees need a straightforward, accessible and fast way to generate new ones, add and change associated metadata, and to change the URL of a resource when necessary. How this is done varies depending on the infrastructure provided by the RA.

Software to automatically check existing DOIs to alert someone if they redirect to broken links would help with maintenance.

Use of DOIs in this way goes against the gov.uk guidance on persistent resolvable identifiers:

“avoids reliance on centralised systems (and particularly those outside government) to manage that resolution”

Constraints introduced by DOIs

Depending on a third-party for your URL resolution infrastructure. In the case of DOI, there is a dependency on the Registration Agency (RA) of your choice, as well as the IDF, which is a US-registered corporation. Maintenance of the doi.org domain name (see: the time doi.org wasn't renewed), the technical infrastructure behind the redirects, and the agreement framework behind these services, are all completely outside the control of the UK gov. This could be a pro or a con, depending on how you look at it.

Identifier pattern limitations. CrossRef, for example, expects publishers to choose a "suffix pattern" to be used consistently across all DOIs. To make adoption of any standard easier, it is recommended to use existing internal identifiers as part of any new identifier as much as possible. For UK gov resources, coming from a variety of sources and government departments, there is no existing pattern for consistent internal identifiers, so something new would need to be minted, completely divorcing the DOIs from any existing identifiers associated with the resource.

Metadata limitations. Each Registration Agency offers different options on the metadata you can provide for a resource associated with a DOI. Research would need to be undertaken to determine if the options are sufficient, and what the path to extending it in future would be, in case needs change.

Metadata ownership. The RA is responsible for hosting metadata. The UK gov would need to consider how to backup the metadata to ensure it is not lost if the RA goes down or changes its governance model.

Registration Agencies

An important consideration is the choice of RA. None of the current RAs seem obviously suited to providing DOIs for UK government data and documents. CrossRef and DataCite are the most likely candidates, as most others have a very specific geographical focus, although they are geared towards academic work. To register DOIs with them, there is a membership fee.

An option would be for the UK gov itself to become an RA. There is overhead - joining the IDF (a membership fee), signing the RA agreement, and setting up the metadata infrastructure. RAs are also expected to promote the use of DOIs generally in their community. Advantages are: freedom to set up custom metadata fields and develop them as needed; having complete control over where and how the metadata is stored; setting the rules for the strings used in the DOIs themselves (so they could relate to existing internal identifiers, if useful); control over the workflow for registering DOIs and updating metadata, to better integrate it into existing internal processes.

Alternatives to DOI

If the core issue preventing the UK gov from depending on its own infrastructure for persistent identifiers is concern about link rot, it would benefit the UK gov in the long run to solve this by establishing a persistence policy for a particular domain or subdomain (could this be part of data.gov.uk? or the National Archives?) rather than depending on doi.org.

This can be used as a redirect service in a similar way to doi.org, without the external dependency. Documents and datasets that may be moved between departments or shifted around the government website as content management policies change can still retain a persistent URL.

A consistent, pre-agreed URL prefix would also meet the use case of inserting an identifier into a document before publication.

Tracking citation data

DOIs enable measuring use/impact and automatic generation of citation data not through some inherent quality of DOIs themselves, but because they are managed by a centralised, third-party service, operating under a set of agreements that set expectations about how these identifiers should resolve. It is possible to achieve the same with any URIs.

If adopted, it is important to consider the data sharing and privacy policies of the RA used for UK gov data, and of doi.org itself. Mandating one of the RA search engines or navigating to a resource via doi.org means that someone who is citing a UK gov document, or looking up a citation of something they have read, has no choice about sharing their activity with these third parties. Similarly, access by the UK gov to these statistics can be restricted by the third parties at any time.

Resource versions and content types

Changes to documents and resources (eg. errata, additional information) which generate new versions would need a new identifier. This makes it clear precisely which version of something is being cited.

There are several questions in this thread about whether an identifier should be for a specific content type of a resource (eg. PDF vs HTML), or for the abstract resource itself. Providing the contents are functionally the same, in the UK gov's case - which is primarily about citation - identifiers should be for the latter. This is normal practice - most DOI RAs make use of HTTP Content Negotiation to do this already.

rhiaro commented 3 years ago

I note several comments in this thread (@woodbine @ChristopherCB) have brought up a need for unique, persistent and consistent identifiers for organisations. The Open Contracting, IATI, 360 Giving and Beneficial Ownership data standards make use of org-id as a repository of identifiers for registration authorities which can be used to derive globally unique identifiers for organisations. This post shows how org-id identifiers can be used to link data across different standards.

JonathanClarkDOIF commented 3 years ago

The DOI Foundation welcomes this challenge and would be happy to work with the Data Standards Authority to develop a formal proposal. This is a lengthy post but we hope that it is helpful.    The use of DOI could provide a citation mechanism that would be robust in the face of the regular changes in government departmental structure and the transition to the National Archives. Whilst no system provides a “magic bullet” and some active management is always necessary, the DOI can remain stable during these changes and allow consistent referencing to the benefit of users.   We do not expect that the cost of an implementation would be a barrier. The details would depend on the chosen membership option, but the DOI Foundation operates on a not-for-profit, cost recovery basis. The main cost driver is the shared infrastructure that we run on behalf of the Registration Agencies (RAs).

We are interested in your concern about the cost of publishing delay but would be surprised if there was any material delay resulting from the use of DOI – an identifying code can be assigned instantaneously at the point of publication, when most metadata is already known and other information can be added later. There is considerable experience in the DOI community that you could drawn upon for advice here.

We believe that organisational structure should be essentially independent of the numerical structure of the code. Once assigned a DOI is an opaque string and the prefix should not be expected to inform the user. Nevertheless, there are options that involve multiple or hierarchical prefixes that may be worth your consideration.   The UK Government could implement DOIs in various ways. It could work with an existing RA; Crossref and DataCite RAs offer both direct membership and consortium models for instance and both have experience working with government agencies, as do other RAs such as JaLC and KISTI. It could decide to become an RA itself or choose a hybrid option. The Office of Publications of the EU is an RA in its own right but outsources the registration operations to another RA (mEDRA). The costs would depend on the model chosen.

RAs provide not only assignment services but also other value-added services such as "Cited-By" (Crossref) and "Event Data" (Crossref and Datacite). These services are sometimes provided to other registration agencies by agreement between them.   We recognise the importance of open standards and it is for this reason that the DOI Foundation promoted ISO 26324 Digital Object Identifier System as an International Standard in 2012. Many UK experts were, through the British Standards Institute (as the UK's National Standards Body), involved in this process and the status of DOI as an International Standard guarantees that it will continue to be operated in an open way. The DOI Foundation was appointed by ISO to administer the standard and its operational governance is transparent and in the hands of users through their Registration Agencies. ISO rules ensure that The DOI Foundation operates on a strict cost recovery basis as a non-profit entity. ISO 26324 is published openly although end users seldom need to consult it because the handbook published by The DOI Foundation contains the information needed for compliance with the standard and consequently for interoperability with other users.

There is a call in the comments for identification to be extended beyond documents and datasets. The DOI is a generic identification standard focused, through profiles, on specific applications. In addition to the scholarly domain, the movie/TV industry has adopted DOI through the Entertainment ID Registry (EIDR) and a forthcoming application will identify construction products for the persistent identification of components in building projects.   And finally, for the sake of clarity, a DOI is not a URN although an application for registration as a URN namespace has recently been made and is pending. The core DOI format is the string starting "10dot". But this is usually represented as a web URI by prefixing the domain of our proxy server. The proxy allows various options beyond simple resolution which are documented in the DOI handbook.

We are of course happy to answer any other questions that arise in your consultation.

Jonathan Clark Managing Agent The DOI Foundation

csarven commented 3 years ago

ISO 26324 is paywalled (88 CHF).

Not particularly an "open" standard. See also https://en.wikipedia.org/wiki/International_Organization_for_Standardization#Criticism_and_laments .


The DOI Handbook's Standards states:

DOI is a registered URI within the info-URI namespace (IETF RFC 4452, the "info" URI Scheme for Information Assets with Identifiers in Public Namespaces).

and RFC 4452 on Why Not Create a New URN Namespace ID for Identifiers from Public Namespaces?:

RFC 2141 [RFC2141] states that "Uniform Resource Names (URNs) are intended to serve as persistent, location-independent, resource identifiers". The "info" URI scheme, on the other hand, does not assert the persistence of the identifiers created under this scheme but rather of the public namespaces grandfathered under this scheme.

Thus, while a DOI string eg. 10.1000/182 is deemed to be a persistent identifier, info:doi/10.1000/182 is not.


In practice, there are issues with DOIs with fragments [citations needed - perhaps others would like to expand on this.]


Re DOI FAQ on Does the DOI system use a Content Distribution Network?:

(2) Cloudflare may store a cookie. This cookie does not store any personally identifiable information, but can be used by Cloudflare to distinguish distinct users using a shared IP address in case different security responses are appropriate; for instance the cookie can determine which clients behind a suspect IP address have successfully passed a challenge. See https://support.cloudflare.com/hc/en-us/articles/200170156-What-does-the-Cloudflare-cfduid-cookie-do-.

That may be a non-starter for some (Third-party tracking / privacy violation, and centralisation through CloudFlare - or any other third-party with the same behaviour - is a potential vulnerability [citation needed - there are plenty].)


Although this may be a tangent or not a concern for the UK government, DOI Registration Agencies such as CrossRef and DataCite are membership-based services for organisations and excludes individuals.


The W3C Recommendation Architecture of the World Wide Web on URI Persistence states:

URI persistence is a matter of policy and commitment on the part of the URI owner.

Thus, as long as the UK government exists, it can allocate URIs to its own resources. By using its own URIs (ie. under the authority of gov.uk), the UK government can set and control a range of specific policies. For instance, some of its URIs may be "more" (or "less") persistent under certain URI paths or conditions (social processes). Each policy associated to a URI can be machine-readable and discoverable. The UK government's registration authority can for instance be handled via id.gov.uk or work directly with the National Archives to allocate HTTP URIs for its resources.

On the other hand, and perhaps needless to say at this point, if the UK government were to outsource the canonical referencing of its own resources on the Web to third-party services, then it ultimately relies on their governance and maintenance.

Put differently, is the UK government counting on the longevity of the DOI Foundation more than its own National Archives?

PeterParslow commented 3 years ago

Simply in response to "Not particularly an "open" standard":

I believe that these challenges accept the UK government's Open Standards Principles, with their definition of open standard

https://www.gov.uk/government/publications/open-standards-principles

In particular, "free to use" means that you won't have to pay someone a patent fee in order to implement it. It doesn't mean that you (the software developer) get the design work done for free, or that certain standards bodies (BSI, ISO, CEN, IEC) are excluded because of their business models.

The idea of agreeing that up front was so that we don't need to revisit that argument for every single challenge.

edent commented 3 years ago

Thank you all for your considered responses and challenging questions. I'm going to keep this discussion open while we wait for the new Open Standards Board to be convened.

Our intention is to take this standard to the Board, along with the evidence that you have provided.

Generally speaking, there are two possible outcomes from the meeting:

  1. The Board asks for more evidence before they can approve. In which case, they will give us detailed feedback for us to address. We will make the decision whether to proceed or not.
  2. The Board approves the standard. In which case, we will set out plans for a pilot programme to see how it would work in practice.

Please do reply with any any other thoughts, and we will let you know the outcome of the board meeting.

ghobona commented 2 years ago

Recent (April 2021) changes in modern web browsers to support HTTPS on the address bar by default mean that there needs to be a clear statement about whether applications should consider the http URL of the resolution service to be equivalent to its https alternative, and vice versa.

ghobona commented 2 years ago

A separate comment.

@frankieroberto pointed out, above, the need to clarify whether a single identifier would apply to different formats of the same document.

@edent Will the proposed OSB standard say something regarding Content Negotiation, since Content Negotiation can address the issue of how to access different formats of the same datasets? I note that crosscite.org offer an explanation of DOI Content Negotiation here.

jezcope commented 2 years ago

Hi @ghobona, thanks for these points! We'll be sure we capture them in the documentation, but in the interests of transparency:

"clear statement about whether applications should consider the http URL of the resolution service to be equivalent to its https alternative, and vice versa"

HTTPS is preferred and expected for any new use, but HTTP will always be considered equivalent along with the old resolver domain dx.doi.org simply because they are still present on older printed documents that predate the HTTPS recommendation, and those can't be updated. The resolver redirects HTTP to HTTPS and dx.doi.org to doi.org before returning any data.

"clarify whether a single identifier would apply to different formats of the same document"

This can vary depending on the exact use case, but we'll make sure we document the common issues and give clear advice. Multiple identifiers for "the same content" is generally discouraged, but where it's unavoidable there is scope in the metadata schema to make this relationship clear.

"Content Negotiation can address the issue of how to access different formats of the same datasets"

At present, Content Negotiation for DOIs focuses on obtaining different formats of the DOI metadata only. There are some common techniques for discovery of different formats of a dataset (usually embedded metadata in the landing page or a well-known URL) but I believe there is a proposal under discussion to do this in a more standardised way via the resolver. I'll try and find out more.

ghobona commented 2 years ago

@jezcope Thanks for the clarifications.

jezcope commented 2 years ago

Just a quick followup now I've got some more detail from the DataCite team:

  1. Both Crossref and DataCite require DOIs resolve to a human-readable page (i.e. not directly to the content)
  2. The DOI standard allows for "multiple resolution" but this is for distributing load between multiple co-hosts rather than different formats of the same content
  3. There is a proposal to add Content URLs to the DataCite metadata schema in the upcoming version 4.5 which is likely to be closest to what you're looking for if I understand correctly, though will not involve Content Negotiation: that proposal will be available for public review later this year
LPardue commented 2 years ago

Thanks @jezcope . Just out of interest, on https://citation.crosscite.org/docs.html section 4, there are listed several values in the content type column. Per the IETF specs, these are Media Types that can formally be registered with IANA; see https://www.iana.org/assignments/media-types/media-types.xhtml.

application/rdf+xml and text/turtle are already registered. Are you aware of any intent to formally register the other media types listed on the page?

jezcope commented 2 years ago

A further update: the forthcoming version 4.5 of the DataCite Schema has now entered a public consultation period, so please feel free to take a look and provide feedback. You can read more about this process and find the proposed new schema version on the DataCite blog.

@ghobona: v4.5 introduces a new Distributions property which allows specification of one or more URLs for direct access to data.

@LPardue: I'm not aware of any such intent at present (and most of those Media Types are "owned" by other groups and organisations, to the extent such an unofficial Media Type can be owned), but I'll pass the recommendation along.

edent commented 1 year ago

ONS are starting a pilot of DOI!

https://cddo.blog.gov.uk/2023/08/09/making-it-easier-to-track-impact-at-ons/

(NB, I don't work here any more - just passing on the good news.)

DrJacqui commented 1 year ago

Love the excitement Terence! CDDO and OSB have been delighted to support ONS to move this forward, with Louise as the champion.

Kind regards Dr Jacqui Taylor Founder, CEO FlyingBinary Ltd http://www.flyingbinary.com

🇬🇧 #15 most influential woman in UK Technology 🌎 Top 21 Most Inspiring Women in Cyber

m: +44 77 899 668 02 e: @. t: @flyingbinary @.>

On Thu, 31 Aug 2023 at 16:35, Terence Eden @.***> wrote:

ONS are starting a pilot of DOI!

https://cddo.blog.gov.uk/2023/08/09/making-it-easier-to-track-impact-at-ons/

(NB, I don't work here any more - just passing on the good news.)

— Reply to this email directly, view it on GitHub https://github.com/co-cddo/open-standards/issues/75#issuecomment-1701274718, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATVAZU4RBWCSSMFOLXHNOOLXYCVMZANCNFSM4TE3WPAQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

DidacFB-CDDO commented 7 months ago

As per housekeeping practices we are closing this with the status as a recommended standard.

PeterParslow commented 7 months ago

As per housekeeping practices we are closing this with the status as a recommended standard.

Does that mean this recommendation: https://www.gov.uk/government/publications/open-standards-for-government/persistent-resolvable-identifiers? Or is there one on DOIs somewhere else?

DidacFB-CDDO commented 7 months ago

Hi Peter, yes that's the case. It was endorsed as "recommended" by the Data Standards Authority Steering Board. See here: https://alphagov.github.io/data-standards-authority/standards/digital-object-identifier/

PeterParslow commented 7 months ago

But https://www.gov.uk/government/publications/open-standards-for-government/persistent-resolvable-identifiers discourages the use of DOIs (see paragraphs in clause 5) whereas https://alphagov.github.io/data-standards-authority/standards/digital-object-identifier/ suggests it is the way to go for DOIs....

oughnic commented 7 months ago

I think the clause 5 is effectively saying use DOIs in their URL form, not URN form. The DSA endorsed standard requires URL.

The DOI page doesn’t refer to the underpinning ISO standard which is unfortunate, and gov.uk doesn’t reference alphagov so the standard isn’t on Open standards for government - GOV.UK (www.gov.uk)https://www.gov.uk/government/publications/open-standards-for-government.