bio-guoda / preston

a biodiversity dataset tracker
MIT License
24 stars 1 forks source link

re-use registry of CETAF stable identifier initiative #101

Closed jhpoelen closed 3 years ago

jhpoelen commented 3 years ago

https://cetaf.org/cetaf-stable-identifiers contains a list of 14 institutions that adopted a URI/URL scheme to reference their specimen.

jhpoelen commented 3 years ago

as provided by @qgroom .

@mielliott we might be able to use these patterns to identify re-use/referencing of specimen in the preston biodiversity data graph.

jhpoelen commented 3 years ago

from the CETAF page retrieved from https://cetaf.org/cetaf-stable-identifiers on 15 Dec 2020 -

Stable Identifiers Implementers Group

To date, 14 CETAF institutions joined the initiative and provide LOD-compliant identifiers for individual specimens. For each of them the following lists provides an example identifier, a link to a catalogue for searching specimens and their identifiers as well as an indication of whether a redirect to machine-readable metadata has already been implemented.

Botanischer Garten und Botanisches Museum Berlin Example: http://herbarium.bgbm.org/object/B100277113 Catalogue: http://ww2.bgbm.org/Herbarium/ Redirect to machine-readable representation: yes

Finnish Museum of Natural History, Helsinki Example: http://id.luomus.fi/GL.749 Catalogue: no Redirect to machine-readable representation: no

Institute of Botany, Slovak Academy of Sciences, Bratislava Example: http://ibot.sav.sk/herbarium/object/SAV0001234 Catalogue: http://ibot.sav.sk/herbarium Redirect to machine-readable representation: no (no redirection by passing rdf header, but rdf is accessible at http://ibot.sav.sk/herbarium/data/SAV0001234.rdf)

Museum für Naturkunde, Berlin Example: http://coll.mfn-berlin.de/u/ZMB_Orth_BA000061S01 Catalogue: no Redirect to machine-readable representation: yes

Muséum national d'histoire naturelle, Paris Example: http://coldb.mnhn.fr/catalognumber/mnhn/ec/ec32 Catalogue: https://science.mnhn.fr/all/search Redirect to machine-readable representation: yes

Naturalis Biodiversity Center, Leiden Example: http://data.biodiversitydata.nl/naturalis/specimen/RMNH.AVES.110103 Catalogue: http://bioportal.naturalis.nl/ Redirect to machine-readable representation: yes

Natural History Museum, London Example: http://data.nhm.ac.uk/object/a9bdc16d-c9ba-4e32-9311-d5250af2b5ac Catalogue: http://data.nhm.ac.uk/ Redirect to machine-readable representation: yes

Natural History Museum - University of Oslo Example: http://purl.org/nhmuio/id/41d9cbb4-4590-4265-8079-ca44d46d27c3 Catalogue: http://nhmo-birds.collectionexplorer.org/ Redirect to machine-readable representation: yes

Royal Botanic Garden Edinburgh Example: data.rbge.org.uk/herb/E00421509 Catalogue: http://elmer.rbge.org.uk/bgbase/vherb/bgbasevherb.php Redirect to machine-readable representation: yes

Staatliches Museum für Naturkunde Stuttgart Example: http://col.smns-bw.org/object/S10000227722006 Catalogue: http://www.smns-bw.org/db/datenbank.php Redirect to machine-readable representation: no

Staatliche Naturwissenschaftliche Sammlungen Bayerns Example: http://id.snsb.info/snsb/collection/97112/153455/93009 Catalogue: http://www.snsb.info/dwb_biocase.html Redirect to machine-readable representation: yes

Zoologisches Forschungsmuseum Alexander Koenig, Bonn Example: http://id.zfmk.de/collection_ZFMK/2003261 Catalogue: https://www.collections.zfmk.de/ Redirect to machine-readable representation: yes (http://herbal.rbge.info/?uri=https://id.zfmk.de/collection_ZFMK/2003261)

Botanic Garden Meise Example: http://www.botanicalcollections.be/specimen/BR0000008422330 Catalogue: http://www.botanicalcollections.be/#/en/home Redirect to machine-readable representation: yes

Royal Museum for Central Africa Example: http://darwinweb.africamuseum.be/object/RMCA_Vert_2011.003.P.1885-1898 Catalogue: http://darwinweb.africamuseum.be/search_specimens Redirect to machine-readable representation: no

qgroom commented 3 years ago

This Google Sheet also has some information, including regular expressions for the identifiers (on the 2nd tab) https://docs.google.com/spreadsheets/d/1vHl2xDghffm6HfQhVeruHV6ZAWAnrc-2LPasq0fOyF4/edit?usp=sharing

jhpoelen commented 3 years ago

@qgroom thanks for sharing the CETAF google sheet link related to the stable identifier initiative. What is the status of the initiative? How often is the spreadsheet maintained?

I've started a Preston Identifier Registry to help capture the various identifier schemes used in our biodiversity informatics communities with the goal to make it easier to track them across the various datasets. Hoping to get your continued input on how to best collect and keep track of these identifier schemes.

jhpoelen commented 3 years ago

btw - I took a first pass at adding the CETAF identifiers to the registry at https://github.com/bio-guoda/preston-identifier-registry .

jhpoelen commented 3 years ago

@qgroom just in case you are interested - I was able to index about 1.8M Meise specimen ids using regex:

http://www.botanicalcollections.be/specimen/[a-zA-Z]+[0-9]+

Here's the top 10 most frequently occurring (one of more occurrences in single dataset count as 1) referenced Meise specimen in unique datasets published in GBIF / iDigBio / BioCASe networks :

frequency of occurrence id
17 http://www.botanicalcollections.be/specimen/BR0000005434497
10 http://www.botanicalcollections.be/specimen/BR0000006953164
9 http://www.botanicalcollections.be/specimen/BR5020151980780
9 http://www.botanicalcollections.be/specimen/BR5020151979777
9 http://www.botanicalcollections.be/specimen/BR5020109189647
9 http://www.botanicalcollections.be/specimen/BR5020107295135
9 http://www.botanicalcollections.be/specimen/BR5020107240555
9 http://www.botanicalcollections.be/specimen/BR5020107238538
9 http://www.botanicalcollections.be/specimen/BR5020107236510
9 http://www.botanicalcollections.be/specimen/BR5020107234493

See attached file for complete list with frequency of occurrence. Please let me know if I missed any. meise-id-freq.txt.gz

jhpoelen commented 3 years ago

I couldn't resist but discover the origin of specimen http://www.botanicalcollections.be/specimen/BR0000005434497 occurrences and found that the specimen is reference in two collections:

Vascular Plant Collection, University of Washington Herbarium

and

Meise Botanic Garden Herbarium (BR)

It looks like your specimen are getting referenced . . . at least once ; ) via

./consume-link-registry.sh location | grep hash://sha256/2aac81caac5e82dc8a533edae38c3e1d1ad477ab8d19f7680b9da75ad54ba6b5
<http://www.pnwherbaria.org/data/getdataset.php?File=WTU_Vascular_DwCA.zip> <http://purl.org/pav/hasVersion> <hash://sha256/2aac81caac5e82dc8a533edae38c3e1d1ad477ab8d19f7680b9da75ad54ba6b5> .

in which the resolved content has associated email addresses:

$ ./consume-link-registry.sh email | grep hash://sha256/2aac81caac5e82dc8a533edae38c3e1d1ad477ab8d19f7680b9da75ad54ba6b5
tmatheso@artsci.wustl.edu   hash://sha256/2aac81caac5e82dc8a533edae38c3e1d1ad477ab8d19f7680b9da75ad54ba6b5
hamid.razifard@gmail.com    hash://sha256/2aac81caac5e82dc8a533edae38c3e1d1ad477ab8d19f7680b9da75ad54ba6b5
wtu@u.washington.edu    hash://sha256/2aac81caac5e82dc8a533edae38c3e1d1ad477ab8d19f7680b9da75ad54ba6b5
blegler@u.washington.edu    hash://sha256/2aac81caac5e82dc8a533edae38c3e1d1ad477ab8d19f7680b9da75ad54ba6b5

with the exact match found in the following line:

$ ./consume-link-registry.sh location | grep "hash://sha256/2aac81caac5e82dc8a533edae38c3e1d1ad477ab8d19f7680b9da75ad54ba6b5" | preston match "[\n].*BR0000005434497"
% Reached end of topic location [0] at offset 3295469: exiting
<urn:uuid:04fb5740-c0d3-49ad-a745-c5df846dbd9b> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> <urn:uuid:04fb5740-c0d3-49ad-a745-c5df846dbd9b> .
<urn:uuid:04fb5740-c0d3-49ad-a745-c5df846dbd9b> <http://www.w3.org/ns/prov#used> <hash://sha256/2aac81caac5e82dc8a533edae38c3e1d1ad477ab8d19f7680b9da75ad54ba6b5> <urn:uuid:04fb5740-c0d3-49ad-a745-c5df846dbd9b> .
<urn:uuid:04fb5740-c0d3-49ad-a745-c5df846dbd9b> <http://purl.org/dc/terms/description> "An activity that finds the locations of text matching the regular expression '[\\n].*BR0000005434497' inside any encountered content (e.g., hash://sha256/... identifiers)."@en <urn:uuid:04fb5740-c0d3-49ad-a745-c5df846dbd9b> .
<cut:zip:hash://sha256/2aac81caac5e82dc8a533edae38c3e1d1ad477ab8d19f7680b9da75ad54ba6b5!/occurrence.txt!/b155490316-155491199> <http://www.w3.org/ns/prov#value> "\n3264094 31de1563-5d6c-422e-a07b-30033a5b4488    9666DDFC-9683-4448-819D-2687084AF012    PhysicalObject  2019-08-08T14:14:05-0800    en  https://creativecommons.org/publicdomain/zero/1.0/University of Washington  PreservedSpecimen           WTU Vascular    WTU         137205  Accession: 137205   Gleichenia linearis Dicranopteris linearis  Gleicheniaceae                  Gleicheniaceae  Gleichenia  linearis        species (Burm. f.) C.B. Clarke          ICBN        R. Bonaparte    1923-10-10      le mercredi [Wednessday].           [Illegible handwriting follows collector's name on label.]      H. Vanderyst    12.600      1921-10 1921    10      274 274     Democratic Republic of the Congo    ZZ  Democratic Republic of the Congo            Region de Bamfunuka.        -4.35   19.633333       10000   John Haskins        \"Bamfunuka\" seems to refer to a people rather than a specific place. Coordinates taken from collection with same locality at http://www.botanicalcollections.be/specimen/BR0000005434497" <urn:uuid:04fb5740-c0d3-49ad-a745-c5df846dbd9b> .

with this, I was also able to find the University of Washington specimen via https://www.pnwherbaria.org/data/results.php?DisplayAs=WebPage&ExcludeCultivated=Y&GroupBy=ungrouped&SortBy=Year&SortOrder=DESC&Herbaria=WTU&QueryCount=1&Accession1=137205&Zoom=4&Lat=55&Lng=-135&PolygonCount=0 (see screenshot)

Screenshot from 2020-12-16 16-20-36

Turns out the Meise specimen http://www.botanicalcollections.be/specimen/BR0000005434497 was used to enrich another records locality information.

Good to know right?

qgroom commented 3 years ago

Very cool! Our stable identifiers are not that old, so it is great to find even one usage.

The barcodes can have a terminal 'V' so the expression should be something link this. http://www.botanicalcollections.be/specimen/[a-zA-Z]+[0-9]+V?

We've been working on various ways of linking collections, though more from the angle of finding duplicate specimens. I'm tagging Anton Güntsch @aguentsch, because I know he will be interested.

You should really write this up as there is really a lot of interest in tracking usage of specimens.

aguentsch commented 3 years ago

I try to keep the google sheet up to date. We are using it ourselves to identify and harvest specimen data from collections which participate in our ID initiative (https://cetafidentifiers.biowikifarm.net/wiki/CETAF_Specimen_Catalogue).

jhpoelen commented 3 years ago

Very cool! Our stable identifiers are not that old, so it is great to find even one usage.

@qgroom Glad to hear that find this useful!

The barcodes can have a terminal 'V' so the expression should be something link this. http://www.botanicalcollections.be/specimen/[a-zA-Z]+[0-9]+V?

@qgroom I've updated the regular expression associated with Meise's specimen. Hoping to re-index all the data in a bit.

You should really write this up as there is really a lot of interest in tracking usage of specimens.

I wish a github issue would be sufficient to communicate our finding to a wider audience. However, if you'd like I can explore ways to put this into a paper of sorts. Did you have a journal in mind that would be suitable for this kind of stuff?

Also note that with current tracking methods in place, we should be able to systematically measure the adoption of Meise's stable identifiers with the collections community over time. If we extend this scheme to include indexing publications that are likely to reference specimen, then you'd measure the adoption of these identifiers in published literature also.

I try to keep the google sheet up to date. We are using it ourselves to identify and harvest specimen data from collections which participate in our ID initiative (https://cetafidentifiers.biowikifarm.net/wiki/CETAF_Specimen_Catalogue).

@aguentsch great to hear that you are actively working on the CETAF stable identifier project.

jhpoelen commented 3 years ago

Oh, and @aguentsch - I was wondering whether you are planning to publish your valuable list of institutions and their associated stable identifiers as a machine readable, citable data product. This way, it would be easier for me to attribute and re-use your work.

qgroom commented 3 years ago

Oh, and @aguentsch - I was wondering whether you are planning to publish your valuable list of institutions and their associated stable identifiers as a machine readable, citable data product. This way, it would be easier for me to attribute and re-use your work.

Good idea!

qgroom commented 3 years ago

I wish a github issue would be sufficient to communicate our finding to a wider audience. However, if you'd like I can explore ways to put this into a paper of sorts. Did you have a journal in mind that would be suitable for this kind of stuff?

"If wishes were horses, beggars would ride" ;-)

I guess a short communication in the Biodiversity Data Journal would be an idea.

I think the key to getting people interested is both having a tool that people can use straight out of the box and having some cool results that other people will want to explore for their own datasets.

aguentsch commented 3 years ago

Oh, and @aguentsch - I was wondering whether you are planning to publish your valuable list of institutions and their associated stable identifiers as a machine readable, citable data product. This way, it would be easier for me to attribute and re-use your work.

@jhpoelen machine readable is already (somehow) covered via the google api. It is certainly not citable. Apart from a proper publication we would have to revise and clean the spreadsheet first. In its present state, it is for "internal" use and would need too much explanation for a wider audience, I guess. I will discuss raise the topic in the next SYNTHESYS NA4 meeting, if you agree (@qgroom).

qgroom commented 3 years ago

I will discuss raise the topic in the next SYNTHESYS NA4 meeting, if you agree (@qgroom).

Yes, good idea! It's not far away now.

jhpoelen commented 3 years ago

I was just revisiting the idea to leverage CETAF's stable identifier registry .

@qgroom @aguentsch please share any updates / publications that may be re-usable, specifically as they relate to occurrence id patterns associated with specific collections.

jhpoelen commented 3 years ago

Please feel free to re-open / comment on the issue to re-start this re-use initiative.