bio-guoda / preston

a biodiversity dataset tracker
MIT License
26 stars 1 forks source link

Add an index command #199

Open mielliott opened 2 years ago

mielliott commented 2 years ago

196 suggests to allow preston to look up URLs associated with a hash. Doing this quickly requires building an index. I can imagine two ways this could work:

1) Feed nquads into preston index for indexing

$ echo <nquads> | preston index
# to index everything:
$ preston ls | preston index
# to just index aliases:
$ preston alias | preston index
# to index whatever you want:
$ preston ls | grep <whatever> | preston index

2) Feed hashes (or nquads containing hashes) into preston index and index their content

$ echo hash://sha256/abc | preston index
# or
$ echo <blah> <...hasVersion> <hash://sha256/abc> | preston index
# to index everything:
$ preston history | preston index
# to index the latest log:
$ preston head | preston index

Option 1 is simpler and makes it easier for the user to pick and choose what goes into the index

Option 2 has the advantage of being able to record where statements came from, which is a big part of what "indexes" do, and also keeps the provenance chain going, which is great. e.g. in a Lucene index where "documents" represent RDF statements, we could record each statement's origin as a line in a provenance log (line:hash://sha256/abc!/L52)

(There's also an option 3: do both option 1 and option 2)

Option 1 is tempting but I think I favor option 2. @jhpoelen thoughts? Or better ideas?

mielliott commented 2 years ago

Could also limit this to

$ preston index

to index all provenance, not allowing any flexibility. But this is significantly less fun

jhpoelen commented 2 years ago

I like the piping of things for sure!

And, I was wondering . . . building an index in just another transformation of some provenance logs. . . and has a specific result (the index), so I was wondering whether you had in mind to be able to do things like:

preston history | preston index | preston process

where, preston process take the nquads generated by the indexing and adds it to the provenance log.

the index would generate some dataset containing a bunch of lucene index files (or insert your favorite indexing method).

Neat thing about this would be that a provenance log would be securely linked to a specific version of an index. With this, you can ask questions like:

Ok Google, can you find me an alias index derived from hash://sha256/abc123 ?

or

Hey Siri, can you ask Google to find me a taxonomic name index derived from hash://sha256/abc123 ?

Would be fun to say out loud right? And, no need to spin those CPUs unnecessarily to regenerate an index that has already been baked somewhere.

mielliott commented 2 years ago

Piping is great! I figured preston index Should automatically append the index generation info to the provenance log because it’s saving to the blobstore, not just pumping results to stdout

Would be fun to say out loud right?

“hash colon slash slash sha 2 5 6 slash alpha beta 1 2 3 …” sounds great. Everyone loves convenient voice commands

mielliott commented 2 years ago

So, to clarify, I imagined building the index in temp/, then zipping everything and tossing it into data/ automatically. Then commands that make use of it (thinking of server commands like preston s —registry —index hash://sha256/abc) would unzip it back into tmp/ before using it. But we could keep it ready in an index/ folder so it’s better for one-off commands like alias

jhpoelen commented 2 years ago

Nice! I want it!

jhpoelen commented 11 months ago

using

#!/bin/bash
#
# index a patched version of provenance graph associated with an anchor
# into oxigraph  
#

preston ls\
 --anchor hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd\
 --remote https://linker.bio\
 | sed -E 's/(<)([a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12})([^ ]*)(>)/<urn:uuid:\2>/g'\
 | pv -l\
 | ./oxigraph_server_v0.3.22_x86_64_linux_gnu load --lenient --format nq --location preston-gib

I was able to load:

82159896 triples loaded in 1604s (51214 t/s)

with

$ du -d1 -h preston-gib/
33G preston-gib/
jhpoelen commented 11 months ago

and then, with another-query.sparql

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?o
WHERE {
  <http://collections.mnhn.fr/ipt/archive.do?r=mnhn-ar> <http://purl.org/pav/hasVersion> ?o .
} limit 10

yielding

$ time cat another-query.sparql | ./oxigraph_server_v0.3.22_x86_64_linux_gnu query --location preston-gib/ --results-format tsv
?o
<hash://sha256/08b0b7ff634cf02132f4dc3b41df5f6d3ca3c6d4beb1e6c80fd5245c817d1849>
<hash://sha256/402990fc84d667d2dc68cf760725e43c63f4039ddbc9e59076fa287462cc3273>
<hash://sha256/c5e69606006c807ab7b886df38fdb9709b913c070ecc36a8d0eeb08f6b61887b>
<hash://sha256/8ac3636cfe810ae0f029f08ddd0cfed7e1d6ee0c63f3d7fb0dc809bc06231d62>
<hash://sha256/91f2b225f75f53914d0528a87500249b1f43357f7a27f230ce2f0e1e0e9526e8>
<hash://sha256/c4b005122e24f9385bce87195e7b37ba1bd4790b33470bff2a7d4ad0831e2cc0>
<hash://sha256/d495d171d8c7098a764a49f6721169c1d4aee1a02d69b5ff02831962e7564404>
<hash://sha256/f2fe90d0c11f4990de095d97439a3856d786e487c49b2926f5734d10caf93174>
<hash://sha256/800855c2b73c3fcd5f63340a4e22c90568d13c326c4ee8b70c3f487b38e1bb97>
<hash://sha256/3ce3de0b4038274f2ce3670f69a8f63122706d6d68b987b2436bdb957bab43a5>

real    0m0.060s
user    0m0.045s
sys 0m0.017s
jhpoelen commented 11 months ago

@mielliott perhaps we have found our indexer in oxigraph . . .

jhpoelen commented 11 months ago

Looking up content associated with a GBIF dataset id https://gbif.org/dataset/4fa7b334-ce0d-4e88-aaae-2e0c138d049e

urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e

see also

SELECT ?archiveUrl ?seenAt ?contentId  
WHERE {
  graph ?g1 {
   <urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e> <http://www.w3.org/ns/prov#hadMember> ?archiveUrl . 
   ?archiveUrl <http://purl.org/dc/elements/1.1/format> "application/dwca" .
  }
  graph ?activity {
    ?activity <http://www.w3.org/ns/prov#used> ?archiveUrl .
    ?activity <http://www.w3.org/ns/prov#generatedAtTime> ?seenAt .
    ?contentId <http://www.w3.org/ns/prov#qualifiedGeneration> ?activity .  
    }
} limit 10

yielding

?archiveUrl ?seenAt ?contentId
https://hosted-datasets.gbif.org/eBird/2020-eBird-dwca-1.0.zip "2021-11-02T16:25:42.407Z"^^http://www.w3.org/2001/XMLSchema#dateTime hash://sha256/b99c3f70f8571cd5bb1d6af84f1dccd5332736e8ac7a96f39e192fe9a7590d1c
https://hosted-datasets.gbif.org/eBird/2020-eBird-dwca-1.0.zip "2022-02-02T06:55:11.184Z"^^http://www.w3.org/2001/XMLSchema#dateTime hash://sha256/b99c3f70f8571cd5bb1d6af84f1dccd5332736e8ac7a96f39e192fe9a7590d1c
https://hosted-datasets.gbif.org/eBird/2020-eBird-dwca-1.0.zip "2021-12-02T09:24:32.779Z"^^http://www.w3.org/2001/XMLSchema#dateTime hash://sha256/b99c3f70f8571cd5bb1d6af84f1dccd5332736e8ac7a96f39e192fe9a7590d1c
https://hosted-datasets.gbif.org/eBird/2020-eBird-dwca-1.0.zip "2021-08-03T01:33:39.136Z"^^http://www.w3.org/2001/XMLSchema#dateTime hash://sha256/b99c3f70f8571cd5bb1d6af84f1dccd5332736e8ac7a96f39e192fe9a7590d1c
https://hosted-datasets.gbif.org/eBird/2020-eBird-dwca-1.0.zip "2021-07-02T12:10:05.604Z"^^http://www.w3.org/2001/XMLSchema#dateTime hash://sha256/b99c3f70f8571cd5bb1d6af84f1dccd5332736e8ac7a96f39e192fe9a7590d1c
https://hosted-datasets.gbif.org/eBird/2020-eBird-dwca-1.0.zip "2021-09-02T01:07:50.41Z"^^http://www.w3.org/2001/XMLSchema#dateTime hash://sha256/b99c3f70f8571cd5bb1d6af84f1dccd5332736e8ac7a96f39e192fe9a7590d1c
https://hosted-datasets.gbif.org/eBird/2020-eBird-dwca-1.0.zip "2021-10-01T23:02:46.359Z"^^http://www.w3.org/2001/XMLSchema#dateTime hash://sha256/b99c3f70f8571cd5bb1d6af84f1dccd5332736e8ac7a96f39e192fe9a7590d1c
https://hosted-datasets.gbif.org/eBird/2020-eBird-dwca-1.0.zip "2022-03-02T02:36:59.419Z"^^http://www.w3.org/2001/XMLSchema#dateTime hash://sha256/b99c3f70f8571cd5bb1d6af84f1dccd5332736e8ac7a96f39e192fe9a7590d1c
https://hosted-datasets.gbif.org/eBird/2020-eBird-dwca-1.0.zip "2022-01-02T04:40:54.857Z"^^http://www.w3.org/2001/XMLSchema#dateTime hash://sha256/b99c3f70f8571cd5bb1d6af84f1dccd5332736e8ac7a96f39e192fe9a7590d1c
https://hosted-datasets.gbif.org/eBird/2020-eBird-dwca-1.0.zip "2021-11-02T16:25:42.407Z"^^http://www.w3.org/2001/XMLSchema#dateTime hash://sha256/b99c3f70f8571cd5bb1d6af84f1dccd5332736e8ac7a96f39e192fe9a7590d1c
jhpoelen commented 11 months ago

here's a query for, and resulting list of, contentIds associated with our eBird friends. Note that this accounts for the introduction of activity namespaces in 2020 https://github.com/bio-guoda/preston/issues/41 .

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?contentId ?seenAt ?archiveUrl WHERE
{ 
{
    SELECT ?contentId ?seenAt ?archiveUrl  
WHERE {
  graph ?g1 {
   <urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e> <http://www.w3.org/ns/prov#hadMember> ?archiveUrl . 
   ?archiveUrl <http://purl.org/dc/elements/1.1/format> "application/dwca" .
  }
  graph ?activity {
    ?activity <http://www.w3.org/ns/prov#used> ?archiveUrl .
    ?activity <http://www.w3.org/ns/prov#generatedAtTime> ?seenAt .
    ?contentId <http://www.w3.org/ns/prov#qualifiedGeneration> ?activity .  
    }
}
}
UNION
{
    SELECT ?contentId ?seenAt ?archiveUrl  
WHERE {
   <urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e> <http://www.w3.org/ns/prov#hadMember> ?archiveUrl . 
   ?archiveUrl <http://purl.org/dc/elements/1.1/format> "application/dwca" .
    ?activity <http://www.w3.org/ns/prov#used> ?archiveUrl .
    ?activity <http://www.w3.org/ns/prov#generatedAtTime> ?seenAt .
    ?contentId <http://www.w3.org/ns/prov#qualifiedGeneration> ?activity .  
}
}
} ORDER BY ?seenAt 

with

cat ebird.sparql\
 | ./oxigraph_server_v0.3.22_x86_64_linux_gnu query --results-format tsv --location preston-gib/\
 | tee ebird.tsv

with first 10 yielding

?contentId ?seenAt ?archiveUrl
hash://sha256/ec3ff57cb48d5c41b77b5d1075738b40f598a900e8be56e7645e5a24013dffc4 "2019-12-02T09:51:47.923Z"^^http://www.w3.org/2001/XMLSchema#dateTime http://ebirddata.ornith.cornell.edu/downloads/gbiff/dwca-1.0.zip
hash://sha256/ee7134043e02f845643b6a655e1c3ffe6d406d0002f8089c5399f0df418b80d6 "2019-12-02T10:00:10.182Z"^^http://www.w3.org/2001/XMLSchema#dateTime http://download.gbif.org/2019/03/2019-eBird-dwca-1.0.zip
https://deeplinker.bio/.well-known/genid/5a12240f-58fe-37ab-be2f-deeca35653c0 "2020-01-01T22:15:58.082Z"^^http://www.w3.org/2001/XMLSchema#dateTime http://ebirddata.ornith.cornell.edu/downloads/gbiff/dwca-1.0.zip
hash://sha256/ee7134043e02f845643b6a655e1c3ffe6d406d0002f8089c5399f0df418b80d6 "2020-01-01T22:24:47.774Z"^^http://www.w3.org/2001/XMLSchema#dateTime http://download.gbif.org/2019/03/2019-eBird-dwca-1.0.zip
https://deeplinker.bio/.well-known/genid/1d102839-ace4-3379-8ff3-2204ebfcae69 "2020-02-02T10:02:32.966Z"^^http://www.w3.org/2001/XMLSchema#dateTime http://ebirddata.ornith.cornell.edu/downloads/gbiff/dwca-1.0.zip
hash://sha256/ee7134043e02f845643b6a655e1c3ffe6d406d0002f8089c5399f0df418b80d6 "2020-02-02T10:09:05.545Z"^^http://www.w3.org/2001/XMLSchema#dateTime http://download.gbif.org/2019/03/2019-eBird-dwca-1.0.zip
https://deeplinker.bio/.well-known/genid/4373758e-5de6-3ebc-bce4-726a03dc8f12 "2020-02-11T11:19:58.521Z"^^http://www.w3.org/2001/XMLSchema#dateTime http://ebirddata.ornith.cornell.edu/downloads/gbiff/dwca-1.0.zip
hash://sha256/ee7134043e02f845643b6a655e1c3ffe6d406d0002f8089c5399f0df418b80d6 "2020-02-11T11:28:25.313Z"^^http://www.w3.org/2001/XMLSchema#dateTime http://download.gbif.org/2019/03/2019-eBird-dwca-1.0.zip
https://deeplinker.bio/.well-known/genid/302b40fb-7610-3402-a711-ccba64635489 "2020-03-02T16:20:05.905Z"^^http://www.w3.org/2001/XMLSchema#dateTime http://ebirddata.ornith.cornell.edu/downloads/gbiff/dwca-1.0.zip

and last 10,

?contentId ?seenAt ?archiveUrl
hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d "2023-10-01T17:02:42.558Z"^^http://www.w3.org/2001/XMLSchema#dateTime https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d "2023-10-01T17:02:42.558Z"^^http://www.w3.org/2001/XMLSchema#dateTime https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d "2023-11-02T04:13:34.257Z"^^http://www.w3.org/2001/XMLSchema#dateTime https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d "2023-11-02T04:13:34.257Z"^^http://www.w3.org/2001/XMLSchema#dateTime https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d "2023-11-02T04:13:34.257Z"^^http://www.w3.org/2001/XMLSchema#dateTime https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d "2023-11-02T04:13:34.257Z"^^http://www.w3.org/2001/XMLSchema#dateTime https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d "2023-12-02T16:05:25.261Z"^^http://www.w3.org/2001/XMLSchema#dateTime https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d "2023-12-02T16:05:25.261Z"^^http://www.w3.org/2001/XMLSchema#dateTime https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d "2023-12-02T16:05:25.261Z"^^http://www.w3.org/2001/XMLSchema#dateTime https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d "2023-12-02T16:05:25.261Z"^^http://www.w3.org/2001/XMLSchema#dateTime https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip

attached ebird.tsv.txt

mielliott commented 11 months ago

@mielliott perhaps we have found our indexer in oxigraph . . .

At long last! Hopefully with no fun surprises like Jena's demanding urn:uuid prefixes, etc.

I should probably mention that I did implement the indexing functionality described in https://github.com/bio-guoda/preston/issues/199#issuecomment-1301489561 in the registry branch, using Lucene. It never made its way into main though. A big limitation with just using Lucene was the lack of a query language like SPARQL, so instead of writing a query.sparql to search the index, I could only do simple string matching.

Do you plan on packaging oxigraph with preston, or keeping it separate as in your examples?

jhpoelen commented 11 months ago

Do you plan on packaging oxigraph with preston, or keeping it separate as in your examples?

@mielliott great question! Not sure yet . . . am almost tempted to treat the oxigraph binaries as assets and add them to the content graph, along with functionality to execute workflows defined in that graph. But other than that, I do not see a compelling reason to merge preston with oxigraph and make it available in a single cli tool. But . . . I we did add a preston scommand to start a web interface, so why not allow something like preston sparql --anchor hash://sha256/.... to start a sparql endpoint for a specific biodiversity data graph. . .

Any ideas? What do you you think, @mielliott ?

jhpoelen commented 11 months ago

I've added some configuration to query the indexed provenance graph of GIB (GBIF, iDigBio, BioCase). The syntax is a bit weird, but grcl was quite helpful to get a usable API in front of the sparql endpoint.

Example query by UUID

using GBIF's uuid for the eBird dataset (most of GBIF's volume, https://www.gbif.org/dataset/4fa7b334-ce0d-4e88-aaae-2e0c138d049e reformatted to urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e), yield the following last 9 most recent ebird samples as retrieved from their reported origin.

curl "https://grlc.io/api-git/bio-guoda/preston-service/uuid.csv?uuid=urn%3Auuid%3A4fa7b334-ce0d-4e88-aaae-2e0c138d049e&endpoint=https%3A%2F%2Flod.globalbioticinteractions.org%2Fquery"\
 | head\
 | mlr --icsv --oxtab cat
doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl   https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt       2023-12-02T16:05:25.261Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl   https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt       2023-11-02T04:13:34.257Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl   https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt       2023-10-01T17:02:42.558Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl   https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt       2023-09-02T12:54:17.912Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl   https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt       2023-08-02T10:47:52.277Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl   https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt       2023-07-02T21:02:24.922Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl   https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt       2023-06-01T19:32:39.368Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl   https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt       2023-05-01T22:45:13.589Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl   https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt       2023-04-02T14:43:59.511Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

Example query by DOI

Using GBIF's assigned DOI https://doi.org/10.15468/aomfnb the following can be retrieved:

curl 'https://grlc.io/api-git/bio-guoda/preston-service/doi.csv?doi=https%3A%2F%2Fdoi.org%2F10.15468%2Faomfnb&endpoint=https%3A%2F%2Flod.globalbioticinteractions.org%2Fquery'\
 | head\
 | mlr --icsv --oxtab cat
doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl   https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt       2023-12-02T16:05:25.261Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl   https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt       2023-11-02T04:13:34.257Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl   https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt       2023-10-01T17:02:42.558Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl   https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt       2023-09-02T12:54:17.912Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl   https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt       2023-08-02T10:47:52.277Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl   https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt       2023-07-02T21:02:24.922Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl   https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt       2023-06-01T19:32:39.368Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl   https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt       2023-05-01T22:45:13.589Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl   https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt       2023-04-02T14:43:59.511Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

Query by URL

query activity by known location of a darwin core archive https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip .

curl 'https://grlc.io/api-git/bio-guoda/preston-service/url.csv?url=https%3A%2F%2Fhosted-datasets.gbif.org%2FeBird%2F2022-eBird-dwca-1.0.zip&endpoint=https%3A%2F%2Flod.globalbioticinteractions.org%2Fquery'\
 | head\
 | mlr --icsv --oxtab cat
doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl   https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt       2023-12-02T16:05:25.261Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl   https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt       2023-11-02T04:13:34.257Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl   https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt       2023-10-01T17:02:42.558Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl   https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt       2023-09-02T12:54:17.912Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

Query by ContentId (aka hash)

Querying for a known dwc archive hash hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d

https://grlc.io/api-git/bio-guoda/preston-service/hash.csv?hash=hash%3A%2F%2Fsha256%2F1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d&endpoint=https%3A%2F%2Flod.globalbioticinteractions.org%2Fquery\
 | head\
 | mlr --icsv --oxtab cat

yields

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl   https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt       2023-12-02T16:05:25.261Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl   https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt       2023-11-02T04:13:34.257Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl   https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt       2023-10-01T17:02:42.558Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd

doi          https://doi.org/10.15468/aomfnb
uuid         urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId    hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl   https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt       2023-09-02T12:54:17.912Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
jhpoelen commented 11 months ago

After some tinkering, I ended up implementing a redirection service.

The idea is that the service uses a content registry of known provenance, then redirects resolved content ids to a repository. Currently, the resolver resolves identifiers to their associated darwin core archives.

You can resolve by:

  1. GBIF dataset uuid
  2. iDigBio recordset uuid
  3. doi
  4. content id
  5. url

For identifiers that are not uniquely tied to content (e.g., uuid, doi, url), the resolver picks the most recent darwin core archive associated with the identifier. So, this implements a kind of a wayback machine for darwin core archives registered in the GBIF/iDigBio universe. For now, you can find provenance information for the redirect in the 302 http redirect response headers.

Example 1. resolve by eBird dataset uuid

curl -I https://linker.bio/urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e

yields

HTTP/1.1 302 Found
[...]
Location: https://linker.bio/hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
ETag: hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
Content-Location: https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
X-UUID: urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
X-DOI: https://doi.org/10.15468/aomfnb
X-PROV: hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
X-PROV-wasInfluencedBy: https://doi.org/10.15468/aomfnb , urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
X-PROV-wasGeneratedBy: urn:uuid:77f3faf7-acd2-4f14-9c0e-4e04ef5b63c7
X-PROV-generatedAtTime: 2023-12-02T16:05:25.261Z
X-PAV-hasVersion: hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
[...]

Where, X-PROV contains a reference to the specific corpus version . This version defines all the content and their known

X-PROV-wasGeneratedBy detailing the activity uuid in the corpus version that found the requested content.

X-PROV-generatedAtTime detailing the time at which the activity found the requested content was started.

X-PAV-hasVersion contains the content id of the content that is redirected to.

X-PROV-wasInfluencedBy contains the entities that are associated with the redirected content. In this case it is the GBIF eBird dataset UUID and DOI that registered in origin url of the content.

Content-Location is the original resource location

Location is the location being redirected to (e.g., https://linker.bio/hash://sha256/....). The client can verify authenticity of the content by inspecting headers, or, perhaps better, the provenance graph itself.

Example 2. resolve by eBird dataset DOI

curl -I https://linker.bio/10.15468/aomfnb

resulting in the same redirection, as expected.

Example 3. resolve by eBird dataset original resource location

curl -I https://linker.bio/https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip

resulting in the same redirection as in examples 1 and 2, as expected.

The index is built using oxigraph (see https://github.com/bio-guoda/preston-service/blob/9466c7ac601902b28ff64e7ac83ed6a9a74624a5/query/index-provenance-graph.sh ) and results in a ~30GiB index. This index is then run as a read-only service using https://github.com/bio-guoda/preston-service/blob/9466c7ac601902b28ff64e7ac83ed6a9a74624a5/systemd/system/preston-registry.service .

The redirect service is configured to query the index, and redirect to a known content repository via configured defined at https://github.com/bio-guoda/preston-service/blob/main/systemd/system/preston-redirect.service .

With this, we have a service that uses a well-defined relation between identifiers and their associated content. No longer we have to rely on DNS, or dynamic databases, because our redirection is anchor in a specific provenance graph (in this case, the provenance graph with version hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd) .

@seltmann @mielliott @cboettig - Can you feel the excitement? Curious to hear your thoughts.

You should be able to resolve any url/uuid/doi associated with darwin core archives registered with idigbio and gbif. At least, as recorded monthly since late 2018 / early 2019.

jhpoelen commented 11 months ago

For a UCSB example . . . I am noticing how there's various ids / locations associated with a specific versioned piece of content - the DwC-A containing the digital collection records and their associated metadata.

id redirect content id
https://www.gbif.org/dataset/d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0 https://linker.bio/urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0 hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a
https://doi.org/10.15468/w6hvhv https://linker.bio/10.15468/w6hvhv hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a
https://ecdysis.org/content/dwca/UCSB-IZC_DwC-A.zip https://linker.bio/https://ecdysis.org/content/dwca/UCSB-IZC_DwC-A.zip hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a
https://www.idigbio.org/portal/recordsets/65007e62-740c-4302-ba20-260fe68da291 https://linker.bio/urn:uuid:65007e62-740c-4302-ba20-260fe68da291 hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a
jhpoelen commented 11 months ago

So, to cite an exact version of a dataset, you can now say something like:

Cheadle Center for Biodiversity and Ecological Restoration (2023). University of California Santa Barbara Invertebrate Zoology Collection. Occurrence dataset https://doi.org/10.15468/w6hvhv as derived from the DwC-A defined in hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd as gathered through activity urn:uuid:603cb45b-c23e-4d3e-a0bf-604d8537296d at 2023-12-03T06:16:07.462Z

Quite the mouthful, and precise.

jhpoelen commented 11 months ago

Now, with added redirect badges for embedding on web pages . . .

with patterns being -

https://linker.bio/badge/[some known url / uuid / doi]

Example:

https://linker.bio/badge/urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0

which renders to:

which would redirect to the associated content via

https://linker.bio/urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0

id redirect url redirect badge content id
https://www.gbif.org/dataset/d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0 https://linker.bio/urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0 hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a
https://doi.org/10.15468/w6hvhv https://linker.bio/10.15468/w6hvhv hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a
https://ecdysis.org/content/dwca/UCSB-IZC_DwC-A.zip https://linker.bio/https://ecdysis.org/content/dwca/UCSB-IZC_DwC-A.zip hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a
https://www.idigbio.org/portal/recordsets/65007e62-740c-4302-ba20-260fe68da291 https://linker.bio/urn:uuid:65007e62-740c-4302-ba20-260fe68da291 hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a
jhpoelen commented 11 months ago

@seltmann you can check whether your UCSB collection is tracked by Preston by embedding DwC-A and EML download buttons on your respective pages using GBIF Dataset DOI, DwC-A endpoint urls, GBIF Dataset UUID, iDigBIo recordset UUIDs -

e.g., https://www.gbif.org/dataset/d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0

can be used as

urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0

to get most recent archived/tracked related DwC-A content using

https://linker.bio/urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0

https://linker.bio/urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0

with badge uri

https://linker.bio/badge/urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0

image

jhpoelen commented 11 months ago

@seltmann please note that I've redesigned the badge to be a FAIR assessment badge.

So, without further ado:

drum roll. . . .

Congratulations to @seltmann and colleagues: UCSB-IZC is FAIR!

Accessed from https://linker.bio/#use-case-4-assessing-fairness-of-biodiversity-data on 2024-01-03 -

image

jhpoelen commented 11 months ago

See also https://discourse.gbif.org/t/assessing-fairness-of-biodiversity-data-through-badges-and-download-buttons/4246

mielliott commented 11 months ago

Amazing stuff @jhpoelen, very fun! I noticed the badges default to calling stuff a DwC-A if the content type is unknown or the content doesn't exist:

https://github.com/bio-guoda/preston/blob/8a912a4b1152f7d8c381cb322c4886d64ee3b687/preston-serve/src/main/java/bio/guoda/preston/server/RedirectingServlet.java#L81-L89

And this kinda confused me when toying around with the new badge feature, asking for badges of silly things like RSS feeds or fake IDs. I can see this causing some confusion if for example something goes wrong for someone's EML/etc. badge, causing linker.bio to instead make a "DwC-A" badge. May I suggest a more unassuming badge when content type can't be determined? A more general "Content", "Error", or just blank? Or maybe there's an "unknown" MimeType or similar.

jhpoelen commented 11 months ago

@mielliott thanks for sharing your thoughts. I can see how a badge with "DwC-A unknown" can be confusing, especially when plugging in any kind of stuff like https://linker.bio/badge/10.12/345 . image What the badge is trying to say is: I couldn't find any trace of a DwC archive associated with "10.12/345". So even if some content is associated with "10.12/345" but it wasn't intended to be DwC, it'll still show the "DwC-A unknown" badge.

So requesting:

https://linker.bio/badge/10.12/345

is equivalent to asking:

https://linker.bio/badge/10.12/345?type=application/dwca

With this information, would you have any suggestions on how to make the "DwC-A unknown" badge less confusing and more informative?

mielliott commented 10 months ago

How about like https://linker.bio/badge/10.12/345?type=cats?

image In this case I'd argue that being less informative is less confusing

PS - I really like the feature to specify the content type 🙌