Open mielliott opened 2 years ago
Could also limit this to
$ preston index
to index all provenance, not allowing any flexibility. But this is significantly less fun
I like the piping of things for sure!
And, I was wondering . . . building an index in just another transformation of some provenance logs. . . and has a specific result (the index), so I was wondering whether you had in mind to be able to do things like:
preston history | preston index | preston process
where, preston process take the nquads generated by the indexing and adds it to the provenance log.
the index would generate some dataset containing a bunch of lucene index files (or insert your favorite indexing method).
Neat thing about this would be that a provenance log would be securely linked to a specific version of an index. With this, you can ask questions like:
Ok Google, can you find me an alias index derived from hash://sha256/abc123 ?
or
Hey Siri, can you ask Google to find me a taxonomic name index derived from hash://sha256/abc123 ?
Would be fun to say out loud right? And, no need to spin those CPUs unnecessarily to regenerate an index that has already been baked somewhere.
Piping is great! I figured preston index
Should automatically append the index generation info to the provenance log because it’s saving to the blobstore, not just pumping results to stdout
Would be fun to say out loud right?
“hash colon slash slash sha 2 5 6 slash alpha beta 1 2 3 …” sounds great. Everyone loves convenient voice commands
So, to clarify, I imagined building the index in temp/, then zipping everything and tossing it into data/ automatically. Then commands that make use of it (thinking of server commands like preston s —registry —index hash://sha256/abc
) would unzip it back into tmp/ before using it. But we could keep it ready in an index/ folder so it’s better for one-off commands like alias
Nice! I want it!
using
#!/bin/bash
#
# index a patched version of provenance graph associated with an anchor
# into oxigraph
#
preston ls\
--anchor hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd\
--remote https://linker.bio\
| sed -E 's/(<)([a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12})([^ ]*)(>)/<urn:uuid:\2>/g'\
| pv -l\
| ./oxigraph_server_v0.3.22_x86_64_linux_gnu load --lenient --format nq --location preston-gib
I was able to load:
82159896 triples loaded in 1604s (51214 t/s)
with
$ du -d1 -h preston-gib/
33G preston-gib/
and then, with another-query.sparql
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?o
WHERE {
<http://collections.mnhn.fr/ipt/archive.do?r=mnhn-ar> <http://purl.org/pav/hasVersion> ?o .
} limit 10
yielding
$ time cat another-query.sparql | ./oxigraph_server_v0.3.22_x86_64_linux_gnu query --location preston-gib/ --results-format tsv
?o
<hash://sha256/08b0b7ff634cf02132f4dc3b41df5f6d3ca3c6d4beb1e6c80fd5245c817d1849>
<hash://sha256/402990fc84d667d2dc68cf760725e43c63f4039ddbc9e59076fa287462cc3273>
<hash://sha256/c5e69606006c807ab7b886df38fdb9709b913c070ecc36a8d0eeb08f6b61887b>
<hash://sha256/8ac3636cfe810ae0f029f08ddd0cfed7e1d6ee0c63f3d7fb0dc809bc06231d62>
<hash://sha256/91f2b225f75f53914d0528a87500249b1f43357f7a27f230ce2f0e1e0e9526e8>
<hash://sha256/c4b005122e24f9385bce87195e7b37ba1bd4790b33470bff2a7d4ad0831e2cc0>
<hash://sha256/d495d171d8c7098a764a49f6721169c1d4aee1a02d69b5ff02831962e7564404>
<hash://sha256/f2fe90d0c11f4990de095d97439a3856d786e487c49b2926f5734d10caf93174>
<hash://sha256/800855c2b73c3fcd5f63340a4e22c90568d13c326c4ee8b70c3f487b38e1bb97>
<hash://sha256/3ce3de0b4038274f2ce3670f69a8f63122706d6d68b987b2436bdb957bab43a5>
real 0m0.060s
user 0m0.045s
sys 0m0.017s
@mielliott perhaps we have found our indexer in oxigraph . . .
Looking up content associated with a GBIF dataset id https://gbif.org/dataset/4fa7b334-ce0d-4e88-aaae-2e0c138d049e
urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
see also
SELECT ?archiveUrl ?seenAt ?contentId
WHERE {
graph ?g1 {
<urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e> <http://www.w3.org/ns/prov#hadMember> ?archiveUrl .
?archiveUrl <http://purl.org/dc/elements/1.1/format> "application/dwca" .
}
graph ?activity {
?activity <http://www.w3.org/ns/prov#used> ?archiveUrl .
?activity <http://www.w3.org/ns/prov#generatedAtTime> ?seenAt .
?contentId <http://www.w3.org/ns/prov#qualifiedGeneration> ?activity .
}
} limit 10
yielding
here's a query for, and resulting list of, contentIds associated with our eBird friends. Note that this accounts for the introduction of activity namespaces in 2020 https://github.com/bio-guoda/preston/issues/41 .
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?contentId ?seenAt ?archiveUrl WHERE
{
{
SELECT ?contentId ?seenAt ?archiveUrl
WHERE {
graph ?g1 {
<urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e> <http://www.w3.org/ns/prov#hadMember> ?archiveUrl .
?archiveUrl <http://purl.org/dc/elements/1.1/format> "application/dwca" .
}
graph ?activity {
?activity <http://www.w3.org/ns/prov#used> ?archiveUrl .
?activity <http://www.w3.org/ns/prov#generatedAtTime> ?seenAt .
?contentId <http://www.w3.org/ns/prov#qualifiedGeneration> ?activity .
}
}
}
UNION
{
SELECT ?contentId ?seenAt ?archiveUrl
WHERE {
<urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e> <http://www.w3.org/ns/prov#hadMember> ?archiveUrl .
?archiveUrl <http://purl.org/dc/elements/1.1/format> "application/dwca" .
?activity <http://www.w3.org/ns/prov#used> ?archiveUrl .
?activity <http://www.w3.org/ns/prov#generatedAtTime> ?seenAt .
?contentId <http://www.w3.org/ns/prov#qualifiedGeneration> ?activity .
}
}
} ORDER BY ?seenAt
with
cat ebird.sparql\
| ./oxigraph_server_v0.3.22_x86_64_linux_gnu query --results-format tsv --location preston-gib/\
| tee ebird.tsv
with first 10 yielding
and last 10,
attached ebird.tsv.txt
@mielliott perhaps we have found our indexer in oxigraph . . .
At long last! Hopefully with no fun surprises like Jena's demanding urn:uuid
prefixes, etc.
I should probably mention that I did implement the indexing functionality described in https://github.com/bio-guoda/preston/issues/199#issuecomment-1301489561 in the registry branch, using Lucene. It never made its way into main though. A big limitation with just using Lucene was the lack of a query language like SPARQL, so instead of writing a query.sparql
to search the index, I could only do simple string matching.
Do you plan on packaging oxigraph with preston, or keeping it separate as in your examples?
Do you plan on packaging oxigraph with preston, or keeping it separate as in your examples?
@mielliott great question! Not sure yet . . . am almost tempted to treat the oxigraph binaries as assets and add them to the content graph, along with functionality to execute workflows defined in that graph. But other than that, I do not see a compelling reason to merge preston with oxigraph and make it available in a single cli tool. But . . . I we did add a preston s
command to start a web interface, so why not allow something like preston sparql --anchor hash://sha256/....
to start a sparql endpoint for a specific biodiversity data graph. . .
Any ideas? What do you you think, @mielliott ?
I've added some configuration to query the indexed provenance graph of GIB (GBIF, iDigBio, BioCase). The syntax is a bit weird, but grcl was quite helpful to get a usable API in front of the sparql endpoint.
using GBIF's uuid for the eBird dataset (most of GBIF's volume, https://www.gbif.org/dataset/4fa7b334-ce0d-4e88-aaae-2e0c138d049e reformatted to urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
), yield the following last 9 most recent ebird samples as retrieved from their reported origin.
curl "https://grlc.io/api-git/bio-guoda/preston-service/uuid.csv?uuid=urn%3Auuid%3A4fa7b334-ce0d-4e88-aaae-2e0c138d049e&endpoint=https%3A%2F%2Flod.globalbioticinteractions.org%2Fquery"\
| head\
| mlr --icsv --oxtab cat
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt 2023-12-02T16:05:25.261Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt 2023-11-02T04:13:34.257Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt 2023-10-01T17:02:42.558Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt 2023-09-02T12:54:17.912Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt 2023-08-02T10:47:52.277Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt 2023-07-02T21:02:24.922Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt 2023-06-01T19:32:39.368Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt 2023-05-01T22:45:13.589Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt 2023-04-02T14:43:59.511Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
Using GBIF's assigned DOI https://doi.org/10.15468/aomfnb the following can be retrieved:
curl 'https://grlc.io/api-git/bio-guoda/preston-service/doi.csv?doi=https%3A%2F%2Fdoi.org%2F10.15468%2Faomfnb&endpoint=https%3A%2F%2Flod.globalbioticinteractions.org%2Fquery'\
| head\
| mlr --icsv --oxtab cat
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt 2023-12-02T16:05:25.261Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt 2023-11-02T04:13:34.257Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt 2023-10-01T17:02:42.558Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt 2023-09-02T12:54:17.912Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt 2023-08-02T10:47:52.277Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt 2023-07-02T21:02:24.922Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt 2023-06-01T19:32:39.368Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt 2023-05-01T22:45:13.589Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/89704be7d158045b1e615c9d0349baed4f7bd2fe908f33875df09b8c60cff299
archiveUrl https://hosted-datasets.gbif.org/eBird/2021-eBird-dwca-1.0.zip
seenAt 2023-04-02T14:43:59.511Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
query activity by known location of a darwin core archive https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip .
curl 'https://grlc.io/api-git/bio-guoda/preston-service/url.csv?url=https%3A%2F%2Fhosted-datasets.gbif.org%2FeBird%2F2022-eBird-dwca-1.0.zip&endpoint=https%3A%2F%2Flod.globalbioticinteractions.org%2Fquery'\
| head\
| mlr --icsv --oxtab cat
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt 2023-12-02T16:05:25.261Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt 2023-11-02T04:13:34.257Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt 2023-10-01T17:02:42.558Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt 2023-09-02T12:54:17.912Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
Querying for a known dwc archive hash hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
https://grlc.io/api-git/bio-guoda/preston-service/hash.csv?hash=hash%3A%2F%2Fsha256%2F1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d&endpoint=https%3A%2F%2Flod.globalbioticinteractions.org%2Fquery\
| head\
| mlr --icsv --oxtab cat
yields
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt 2023-12-02T16:05:25.261Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt 2023-11-02T04:13:34.257Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt 2023-10-01T17:02:42.558Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
doi https://doi.org/10.15468/aomfnb
uuid urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
contentId hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
archiveUrl https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
seenAt 2023-09-02T12:54:17.912Z
provenanceId hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
After some tinkering, I ended up implementing a redirection service.
The idea is that the service uses a content registry of known provenance, then redirects resolved content ids to a repository. Currently, the resolver resolves identifiers to their associated darwin core archives.
You can resolve by:
For identifiers that are not uniquely tied to content (e.g., uuid, doi, url), the resolver picks the most recent darwin core archive associated with the identifier. So, this implements a kind of a wayback machine for darwin core archives registered in the GBIF/iDigBio universe. For now, you can find provenance information for the redirect in the 302 http redirect response headers.
curl -I https://linker.bio/urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
yields
HTTP/1.1 302 Found
[...]
Location: https://linker.bio/hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
ETag: hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
Content-Location: https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
X-UUID: urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
X-DOI: https://doi.org/10.15468/aomfnb
X-PROV: hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd
X-PROV-wasInfluencedBy: https://doi.org/10.15468/aomfnb , urn:uuid:4fa7b334-ce0d-4e88-aaae-2e0c138d049e
X-PROV-wasGeneratedBy: urn:uuid:77f3faf7-acd2-4f14-9c0e-4e04ef5b63c7
X-PROV-generatedAtTime: 2023-12-02T16:05:25.261Z
X-PAV-hasVersion: hash://sha256/1e2b7436fce1848f41698e5a9c193f311abaf0ee051bec1a2e48b5106d29524d
[...]
Where,
X-PROV
contains a reference to the specific corpus version . This version defines all the content and their known
X-PROV-wasGeneratedBy
detailing the activity uuid in the corpus version that found the requested content.
X-PROV-generatedAtTime
detailing the time at which the activity found the requested content was started.
X-PAV-hasVersion
contains the content id of the content that is redirected to.
X-PROV-wasInfluencedBy
contains the entities that are associated with the redirected content. In this case it is the GBIF eBird dataset UUID and DOI that registered in origin url of the content.
Content-Location
is the original resource location
Location
is the location being redirected to (e.g., https://linker.bio/hash://sha256/....). The client can verify authenticity of the content by inspecting headers, or, perhaps better, the provenance graph itself.
curl -I https://linker.bio/10.15468/aomfnb
resulting in the same redirection, as expected.
curl -I https://linker.bio/https://hosted-datasets.gbif.org/eBird/2022-eBird-dwca-1.0.zip
resulting in the same redirection as in examples 1 and 2, as expected.
The index is built using oxigraph (see https://github.com/bio-guoda/preston-service/blob/9466c7ac601902b28ff64e7ac83ed6a9a74624a5/query/index-provenance-graph.sh ) and results in a ~30GiB index. This index is then run as a read-only service using https://github.com/bio-guoda/preston-service/blob/9466c7ac601902b28ff64e7ac83ed6a9a74624a5/systemd/system/preston-registry.service .
The redirect service is configured to query the index, and redirect to a known content repository via configured defined at https://github.com/bio-guoda/preston-service/blob/main/systemd/system/preston-redirect.service .
With this, we have a service that uses a well-defined relation between identifiers and their associated content. No longer we have to rely on DNS, or dynamic databases, because our redirection is anchor in a specific provenance graph (in this case, the provenance graph with version hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd) .
@seltmann @mielliott @cboettig - Can you feel the excitement? Curious to hear your thoughts.
You should be able to resolve any url/uuid/doi associated with darwin core archives registered with idigbio and gbif. At least, as recorded monthly since late 2018 / early 2019.
For a UCSB example . . . I am noticing how there's various ids / locations associated with a specific versioned piece of content - the DwC-A containing the digital collection records and their associated metadata.
id | redirect | content id |
---|---|---|
https://www.gbif.org/dataset/d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0 | https://linker.bio/urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0 | hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a |
https://doi.org/10.15468/w6hvhv | https://linker.bio/10.15468/w6hvhv | hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a |
https://ecdysis.org/content/dwca/UCSB-IZC_DwC-A.zip | https://linker.bio/https://ecdysis.org/content/dwca/UCSB-IZC_DwC-A.zip | hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a |
https://www.idigbio.org/portal/recordsets/65007e62-740c-4302-ba20-260fe68da291 | https://linker.bio/urn:uuid:65007e62-740c-4302-ba20-260fe68da291 | hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a |
So, to cite an exact version of a dataset, you can now say something like:
Cheadle Center for Biodiversity and Ecological Restoration (2023). University of California Santa Barbara Invertebrate Zoology Collection. Occurrence dataset https://doi.org/10.15468/w6hvhv as derived from the DwC-A defined in hash://sha256/5b7fa37bf8b64e7c935c4ff3389e36f8dd162f0705410dd719fd089e1ea253cd as gathered through activity urn:uuid:603cb45b-c23e-4d3e-a0bf-604d8537296d at 2023-12-03T06:16:07.462Z
Quite the mouthful, and precise.
Now, with added redirect badges for embedding on web pages . . .
with patterns being -
https://linker.bio/badge/[some known url / uuid / doi]
Example:
https://linker.bio/badge/urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0
which renders to:
which would redirect to the associated content via
https://linker.bio/urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0
id | redirect url | redirect badge | content id |
---|---|---|---|
https://www.gbif.org/dataset/d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0 | https://linker.bio/urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0 | hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a | |
https://doi.org/10.15468/w6hvhv | https://linker.bio/10.15468/w6hvhv | hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a | |
https://ecdysis.org/content/dwca/UCSB-IZC_DwC-A.zip | https://linker.bio/https://ecdysis.org/content/dwca/UCSB-IZC_DwC-A.zip | hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a | |
https://www.idigbio.org/portal/recordsets/65007e62-740c-4302-ba20-260fe68da291 | https://linker.bio/urn:uuid:65007e62-740c-4302-ba20-260fe68da291 | hash://sha256/f5d8f67c1eca34cbba1abac12f353585c78bb053bc8ce7ee7e7a78846e1bfc4a |
@seltmann you can check whether your UCSB collection is tracked by Preston by embedding DwC-A and EML download buttons on your respective pages using GBIF Dataset DOI, DwC-A endpoint urls, GBIF Dataset UUID, iDigBIo recordset UUIDs -
e.g., https://www.gbif.org/dataset/d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0
can be used as
urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0
to get most recent archived/tracked related DwC-A content using
https://linker.bio/urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0
https://linker.bio/urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0
with badge uri
https://linker.bio/badge/urn:uuid:d6097f75-f99e-4c2a-b8a5-b0fc213ecbd0
@seltmann please note that I've redesigned the badge to be a FAIR assessment badge.
So, without further ado:
drum roll. . . .
Congratulations to @seltmann and colleagues: UCSB-IZC is FAIR!
Accessed from https://linker.bio/#use-case-4-assessing-fairness-of-biodiversity-data on 2024-01-03 -
Amazing stuff @jhpoelen, very fun! I noticed the badges default to calling stuff a DwC-A if the content type is unknown or the content doesn't exist:
And this kinda confused me when toying around with the new badge feature, asking for badges of silly things like RSS feeds or fake IDs. I can see this causing some confusion if for example something goes wrong for someone's EML/etc. badge, causing linker.bio to instead make a "DwC-A" badge. May I suggest a more unassuming badge when content type can't be determined? A more general "Content", "Error", or just blank? Or maybe there's an "unknown" MimeType or similar.
@mielliott thanks for sharing your thoughts. I can see how a badge with "DwC-A unknown" can be confusing, especially when plugging in any kind of stuff like https://linker.bio/badge/10.12/345 . What the badge is trying to say is: I couldn't find any trace of a DwC archive associated with "10.12/345". So even if some content is associated with "10.12/345" but it wasn't intended to be DwC, it'll still show the "DwC-A unknown" badge.
So requesting:
https://linker.bio/badge/10.12/345
is equivalent to asking:
https://linker.bio/badge/10.12/345?type=application/dwca
With this information, would you have any suggestions on how to make the "DwC-A unknown" badge less confusing and more informative?
How about like https://linker.bio/badge/10.12/345?type=cats?
In this case I'd argue that being less informative is less confusing
PS - I really like the feature to specify the content type 🙌
196 suggests to allow preston to look up URLs associated with a hash. Doing this quickly requires building an index. I can imagine two ways this could work:
1) Feed nquads into
preston index
for indexing2) Feed hashes (or nquads containing hashes) into
preston index
and index their contentOption 1 is simpler and makes it easier for the user to pick and choose what goes into the index
Option 2 has the advantage of being able to record where statements came from, which is a big part of what "indexes" do, and also keeps the provenance chain going, which is great. e.g. in a Lucene index where "documents" represent RDF statements, we could record each statement's origin as a line in a provenance log (
line:hash://sha256/abc!/L52
)(There's also an option 3: do both option 1 and option 2)
Option 1 is tempting but I think I favor option 2. @jhpoelen thoughts? Or better ideas?