bio-guoda / preston

a biodiversity dataset tracker
MIT License
24 stars 1 forks source link

respond to suspicious claim in Page 2023 doi:10.3897/BDJ.11.e107914 that "[...] To date, there are no widely used hash-based identifiers for publications. [...]" #259

Open jhpoelen opened 1 year ago

jhpoelen commented 1 year ago

In (Page 2023), the following claim is made:

Unlike DOIs and similar identifiers, there is typically no centralised mechanism to resolve hash-based identifiers. Some decentralised systems have been developed, but it is unclear if they themselves will persist. To date, there are no widely used hash-based identifiers for publications.

References

Page R (2023) Ten years and a million links: building a global taxonomic library connecting persistent identifiers for names, publications and people. Biodiversity Data Journal 11: e107914. https://doi.org/10.3897/BDJ.11.e107914

jhpoelen commented 1 year ago

@rdmpage Congats with your recent publication "Ten years and a million links: building a global taxonomic library connecting persistent identifiers for names, publications and people" in Biodiversity Data Journal . Great to see that you are making great strides to facilitate finding connections between people, their publications and associated taxa.

Curious to hear your thoughts on my following, and initial, response to a claim you made in your publication (see above).

As mentioned in Elliott, M.J., Poelen, J.H. & Fortes, J.A.B. Signing data citations enables data verification and citation persistence. Sci Data 10, 419 (2023). https://doi.org/10.1038/s41597-023-02230-y hash://sha256/f849c870565f608899f183ca261365dce9c9f1c5441b1c779e0db49df9c2a19d , existing (data) publication platforms already support retrieval by content (or hash-based) identifiers. These include, but are not limited to: DataOne https://dataone.org, Zenodo https://zenodo.org, Software Heritage Library https://softwareheritage.org, and Wikimedia Commons https://commons.wikimedia.org/ . For examples on how these existing capacities can be used to create the appearance of a centralized service, please see https://github.com/bio-guoda/preston/issues/181 (DataOne), https://github.com/bio-guoda/preston/issues/149 (Zenodo), https://github.com/bio-guoda/preston/issues/70 (software heritage library), and https://github.com/bio-guoda/preston/issues/239 (wikimedia commons). With this, you can retrieve a picture of a bunny provided by https://commons.wikimedia.org/wiki/File:Oryctolagus_cuniculus_Rcdo.jpg via https://linker.bio/hash://sha1/86fa30f32d9c557ea5d2a768e9c3595d3abb17a2 . Also, you can retrieve a datafile https://zenodo.org/record/7977436/files/triples.nt indirectly mentioned in this publication via their DOI https://doi.org/10.5281/zenodo.7977435 https://linker.bio/hash://md5/c1e6c5410e49eea484a4a873589967d7 . Both examples did not require any special processing other than re-using content retrieval methods already available. This is why I'd like to encourage you to revise your potentially misleading section on content-based identifiers to reflect their widespead use in existing platforms mentioned above and beyond.

Also, published as an inline text comment via https://bdj.pensoft.net/article/107914/list/13/ . (see attached screenshot).

Screenshot from 2023-09-26 11-17-11

jhpoelen commented 1 year ago

See also https://hyp.is/FT1sVlyPEe61HEsqRwMq3A/bdj.pensoft.net/article_preview.php?id=107914 . and https://hypothes.is/a/FT1sVlyPEe61HEsqRwMq3A and attached screenshot.

image

jhpoelen commented 1 year ago

@rdmpage responded via hypothes.is annotation available through https://hypothes.is/a/FT1sVlyPEe61HEsqRwMq3A. Note that the content associated with the hypothesis annotation link has drifted (or changed) with his comment and my brief request for clarification.

I find it ironic how the subject of discussion (e.g., Page 2023 https://doi.org/10.3897/BDJ.11.e107914 ) has been extended / annotated but there's no mechanism provided (that I know of) to point to a specific version of that extension. I think this (scholarly?) exchange presents an excellent example of the topic at hand: how to cite (digital) works so that they can be retrieved as originally cited?

So, to make sure to link the comments to this related github issue, I'll capture a screenshot and provide copy-pasted text in addition.

rdmpage 3 hrs ago

Hi @jhpoelen1829, I knew I could rely on you to provide feedback ;)

The examples you provide are not for publications, so I stand by the claim: "there are no widely used hash-based identifiers for > publications". Yes, one can compute hashes for content, for example my BioNames project uses sha1 hashes to identify > PDFs. But I've not seen hashes used to identify scholarly publications, and indeed doing so would get complicated because a publication can have multiple representations (HTML, XML, PDF, ePub), each would have different hashes, requiring us to somehow link those hashes together. Nor can we necessarily rely on hashes, given that publishers often generate uniquely stamped PDFs when users download the PDF (so that hashes of the "same" content need not match). I'm typing this in hypothesis.is, which has documented some of the challenges in identifying PDFs from their digital signature.

If the argument is whether there can be hash-based identifiers for publications, then sure, there can be (with the complications > I mentioned above). But there aren't any (that I'm aware of).

Regarding hash resolvers, I wasn't aware of linker.bio. I had in mind Ben Trask's writings, hosted onbentrask.com, which is now dead, as is the hash resolver he discussed hash-archive.org. So forgive me if I'm skeptical about the longevity of hash resolvers. Yes, some sites support hash-based resolving via API calls, and you can put a service in front of those (reminiscent of how we resolve LSIDs). That's great, but you still need people to cite those hashes (and know that they exist in the first place, there is no hash on File:Oryctolagus_cuniculus_Rcdo.jp that I can see.

I'm sympathetic to hash-based identifiers, but I think it is fair to say they are not that common, nor are they often cited.

and my reply:

jhpoelen1829 (edited) 2 hrs ago

Hey @rdmpage - Thanks for taking the time to reply.

Perhaps a good starting point is to agree on definitions.

You mention:

The examples you provide are not for publications, so I stand by the claim: "there are no widely used hash-based identifiers for publications".

So, please humor me by considering these basic, perhaps silly, questions:

What is your definition of a publication?

Do you consider a publication a well-defined physical/digital artifact?

Or is a publication more like a collection of abstract ideas, that, by definition, cannot be referenced directly nor pointed to with a finger?

Additional questions would be: How would you reference an abstract idea? By a particular physical/digital piece of evidence that reflects that idea? And, could such a piece of evidence be a publication?

See also attached screenshots.

Screenshot from 2023-09-26 17-08-22 Screenshot from 2023-09-26 17-08-28

jhpoelen commented 11 months ago

On 27 Sept 2023 (2023-09-27) @rdmpage replied via https://hypothes.is/a/fOpuuF0BEe6Py-usR5zoCw to @jhpoelen hypothes.is comment https://hypothes.is/a/sS42-FykEe6WVyMSoSL2vA posted on 26 Sept 2023 (2023-09-26) -

rdmpage Sep 27 @jhpoelen1829 I'm using publication loosely in the sense of article, book, chapter, etc. I guess conceptually I would follow a (much) simplified version of Functional Requirements for Bibliographic Records - we have "works" and their "representations" (what schema.org refers to as encodings). The paper we are discussing has various representations (HTML, PDF, etc.). I replied to you above using hypothes.is while viewing the HTML version. I'm now typing this comment using the same system to view the PDF version. The "work" has a DOI, which I take to refer to the work and all its representations. So I feel that I am commenting on the same paper, regardless of whether it is the HTML or the PDF. Likewise we typically cite works rather than particular representations.

I saw your comment https://github.com/bio-guoda/preston/issues/259#issuecomment-1736374540

I find it ironic how the subject of discussion (e.g., Page 2023 https://doi.org/10.3897/BDJ.11.e107914 ) has been extended / annotated but there's no mechanism provided (that I know of) to point to a specific version of that extension.

I would model this differently. I would treat the annotations as an overlay on a (notionally) unchanged digital object (the HTML, PDF, etc.). Obviously in practice the HTML of the web site will have changed (because the annotation has been injected into the page), but it we want exact byte matching for web pages then for any modern web page this is likely impossible (given all the behind the scenes Javascript, etc. that gets injected into most pages). So I don't think we need a new identifier for the annotated HTML, rather we make a link between the annotation and the thing being annotated.

As an aside, have you seen Dorian Taylor's writings, e.g. Summer of Protocols: Retrofitting the Web?. They might appeal. He separates content into transparent (e.g., metadata) and opaque (e.g., image) and uses RDF for the former and a hash-based content store for the later.

See also attached screenshot from the hypothesis image

jhpoelen commented 11 months ago

In reply to @rdmpage 's https://hypothes.is/a/fOpuuF0BEe6Py-usR5zoCw as accessed on 2023-10-17 -

@rdmpage said -

I'm using publication loosely in the sense of article, book, chapter, etc. I guess conceptually I would follow a (much) simplified version of Functional Requirements for Bibliographic Records - we have "works" and their "representations" (what schema.org refers to as encodings). The paper we are discussing has various representations (HTML, PDF, etc.). I replied to you above using hypothes.is while viewing the HTML version. I'm now typing this comment using the same system to view the PDF version. The "work" has a DOI, which I take to refer to the work and all its representations. So I feel that I am commenting on the same paper, regardless of whether it is the HTML or the PDF. Likewise we typically cite works rather than particular representations.

Thank you for referencing the "Functional Requirements for Bibliographic Records" (FRBR) model as a way to express what you mean when you say "publication". So, if I understand it correctly, you see publications equivalent to "works" as mentioned in FRBR's

"[...] Group 1 entities are work, expression, manifestation, and item (WEMI). They represent the products of intellectual or artistic endeavor. [...]"

So, relating this to what I asked you on 26 Sept 2023 in https://hypothes.is/a/sS42-FykEe6WVyMSoSL2vA, namely,

Additional questions would be: How would you reference an abstract idea? By a particular physical/digital piece of evidence that reflects that idea? And, could such a piece of evidence be a publication?

I understand you think of a publication as the following: A bibliographic reference points to a work, and a DOI is an identifier for a work.

And. . . according to the definition, a work is related to their expressions, manifestations (did you mean to say "manifestations" instead of "representations"?), and items. As far as I understand manifestations and items are physical objects (e.g., a digital copy), whereas works and expressions are non-physical.

So, when you are citing a work, you are citing not just the work, but also their expressions, manifestations, and items. How else do you know that this work exists without encountering some manifestation of it? And, as you implied, there are many such manifestations and items, especially when publishers inject banner ads or some kind of watermark into the digital copies that they issue. And because are likely many manifestations and items (e.g., rendered HTML pages, watermarked PDFs) related to a single work, describing the work fully becomes a rather involved task - you'd have to track all copies. One way to cite the work is to point to some abstract, unverifiable object like a DOI, or a text fragment formatted as a bibliographic citation. In their current use, I can see how a DOI points to a work without expressions, manifestations or items. With this, a DOI technically cites something that only exists in the abstract. In other words, a DOI doesn't point to anything physical, but instead is more like an axiom, some abstract idea.

And, as far as I understand, the reader expects an item to manifest itself after "clicking on" (or materializing?) a DOI: the reader expects a item, whereas the DOI only points into some abstract space, leaving some server that handles a DOI resolve request to pull something up that they interpret to be a manifestation of the associated DOI. And this forces the reader to trust the server (or whenever presents some item) to be perfect.

And, in my experience, communication networks and the servers and the humans that power them, are far from perfect. Not only that, when dealing with large amounts of DOIs, violations of trust (willingly or not) are to be expected and will accumulate over time, rendering "big data" analysis on referenced works untrustworthy and expensive to fact check.

So, this is why I am advocating to add at least one content identifier to a digital "item" (a specific copy of a manifestation of an expression of a work): to enable "big data" analysis on scientific corpora. And one such analysis can be as simple as: given two copies corpus consisting of 1TiB across millions of digital files, I'd like to show that they are the exact items as cited in a single bibliographic citation using a commonly available resources (e.g., internet connection, a laptop) and a little time (hours/days).

Perhaps I can phrase my desire to be specific when citing works with digital items in a question:

How do you imagine enabling large scale data analysis without being able to verify that an item (or copy) of a cited digital corpus is authentic?

-jorrit

PS As far as you commenting an item of your work related to https://doi.org/10.3897/BDJ.11.e107914 via https://bdj.pensoft.net/article/107914/download/pdf/ . . . why not say, hey, I am commenting on this specific item retrieved from https://bdj.pensoft.net/article/107914/download/pdf/ with hash://sha256/7abbc8d544bc43734dbe524fe78f4c1ba93ae74cce6a8bc886f916a71896c8fe and hash://md5/435643f61997c3006e24d793ec29917b to say something like:

Page R (2023) Ten years and a million links: building a global taxonomic library connecting persistent identifiers for names, publications and people. Biodiversity Data Journal 11: e107914. https://doi.org/10.3897/BDJ.11.e107914 . Accessed via https://bdj.pensoft.net/article/107914/download/pdf/ with hash://sha256/7abbc8d544bc43734dbe524fe78f4c1ba93ae74cce6a8bc886f916a71896c8fe and hash://md5/435643f61997c3006e24d793ec29917b.

With that reference, at least I can verify that we are looking at the same item. Right now, I am left with trust, but cannot verify. Perhaps not a big issue now that pdf is still "warm", but what about 5 years, 50 years from now? Or if Pensoft decides to no longer server copies of your paper?

page2023.pdf

jhpoelen commented 11 months ago

cross posted via https://hypothes.is/a/65ArJm0qEe67IEOtGBnYAg (see also screenshot below).

image

jhpoelen commented 11 months ago

via https://hypothes.is/a/V9DjuG2NEe6TI3c3SAp42w @rdmpage responded -

@jhpoelen1829 You might find Digital Object Identifiers: Promise and Problems for Scholarly Publishing interesting. The origins of DOIs are in managing access to intellectual property assets, typically behind paywalls. I understand the desire for content based identifiers where the content can be verified, but that’s not the model adopted by academic publishers. For content that is open access we could build a mapping between DOIs and hashes for PDFs (although there may be as many hashes as downloads if the PDF is watermarked). The tasks are many, we are few.