bio-guoda / preston

a biodiversity dataset tracker
MIT License
25 stars 1 forks source link

How do I find linker.bio's monthly preston crawls? #245

Closed mielliott closed 4 months ago

mielliott commented 1 year ago
$ preston history --remote https://linker.bio
<hash://sha256/9b895c0c7db3ea32c99a2bab89476251e7ada77c9a2167f00b7106d438f8c06e> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/f33efc4c0c79f47acddd92527a854513c6bb726c67c7c9d92c69e1ff532aaf2e> .
<hash://sha256/f33efc4c0c79f47acddd92527a854513c6bb726c67c7c9d92c69e1ff532aaf2e> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/3e4eb728b49a38799a2d64de8a04a171e23aceb2e46889b2988128f91499c2a1> .
<hash://sha256/3e4eb728b49a38799a2d64de8a04a171e23aceb2e46889b2988128f91499c2a1> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/8058dcbcd5a6cc07d98749c3446560b8626afd6e133d762c1e1c0f69d7af786e> .
<urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/pav/hasVersion> <hash://sha256/8058dcbcd5a6cc07d98749c3446560b8626afd6e133d762c1e1c0f69d7af786e> .
$ preston get --remote https://linker.bio hash://sha256/8058dcbcd5a6cc07d98749c3446560b8626afd6e133d762c1e1c0f69d7af786e
<https://preston.guoda.bio> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#SoftwareAgent> <urn:uuid:374b07ed-3a31-4dad-bb68-2f34c2225a45> .
<https://preston.guoda.bio> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Agent> <urn:uuid:374b07ed-3a31-4dad-bb68-2f34c2225a45> .
<https://preston.guoda.bio> <http://purl.org/dc/terms/description> "Preston is a software program that finds, archives and provides access to biodiversity datasets."@en <urn:uuid:374b07ed-3a31-4dad-bb68-2f34c2225a45> .
<urn:uuid:374b07ed-3a31-4dad-bb68-2f34c2225a45> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> <urn:uuid:374b07ed-3a31-4dad-bb68-2f34c2225a45> .
<urn:uuid:374b07ed-3a31-4dad-bb68-2f34c2225a45> <http://purl.org/dc/terms/description> "A crawl event that discovers biodiversity archives."@en <urn:uuid:374b07ed-3a31-4dad-bb68-2f34c2225a45> .
<urn:uuid:374b07ed-3a31-4dad-bb68-2f34c2225a45> <http://www.w3.org/ns/prov#startedAtTime> "2023-02-15T23:51:23.481Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <urn:uuid:374b07ed-3a31-4dad-bb68-2f34c2225a45> .
<urn:uuid:374b07ed-3a31-4dad-bb68-2f34c2225a45> <http://www.w3.org/ns/prov#wasStartedBy> <https://preston.guoda.bio> <urn:uuid:374b07ed-3a31-4dad-bb68-2f34c2225a45> .
<https://doi.org/10.5281/zenodo.1410543> <http://www.w3.org/ns/prov#usedBy> <urn:uuid:374b07ed-3a31-4dad-bb68-2f34c2225a45> <urn:uuid:374b07ed-3a31-4dad-bb68-2f34c2225a45> .
<https://doi.org/10.5281/zenodo.1410543> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/dc/dcmitype/Software> <urn:uuid:374b07ed-3a31-4dad-bb68-2f34c2225a45> .
<https://doi.org/10.5281/zenodo.1410543> <http://purl.org/dc/terms/bibliographicCitation> "Jorrit Poelen, Icaro Alzuru, & Michael Elliott. 2021. Preston: a biodiversity dataset tracker (Version 0.5.2) [Software]. Zenodo. http://doi.org/10.5281/zenodo.1410543"@en <urn:uuid:374b07ed-3a31-4dad-bb68-2f34c2225a45> .
<urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> <urn:uuid:374b07ed-3a31-4dad-bb68-2f34c2225a45> .
<urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/dc/terms/description> "A biodiversity dataset graph archive."@en <urn:uuid:374b07ed-3a31-4dad-bb68-2f34c2225a45> .
<hash://sha256/eb557c6d8fee0a9177143987ab0f0146b8fa9f6c40ec373056a8a8fa01366836> <http://www.w3.org/ns/prov#wasGeneratedBy> <urn:uuid:a89831b4-6690-4d05-adfe-09412bd57bc4> <urn:uuid:a89831b4-6690-4d05-adfe-09412bd57bc4> .
<hash://sha256/eb557c6d8fee0a9177143987ab0f0146b8fa9f6c40ec373056a8a8fa01366836> <http://www.w3.org/ns/prov#qualifiedGeneration> <urn:uuid:a89831b4-6690-4d05-adfe-09412bd57bc4> <urn:uuid:a89831b4-6690-4d05-adfe-09412bd57bc4> .
<urn:uuid:a89831b4-6690-4d05-adfe-09412bd57bc4> <http://www.w3.org/ns/prov#generatedAtTime> "2023-02-15T23:51:23.642Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <urn:uuid:a89831b4-6690-4d05-adfe-09412bd57bc4> .
<urn:uuid:a89831b4-6690-4d05-adfe-09412bd57bc4> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> <urn:uuid:a89831b4-6690-4d05-adfe-09412bd57bc4> .
<urn:uuid:a89831b4-6690-4d05-adfe-09412bd57bc4> <http://www.w3.org/ns/prov#wasInformedBy> <urn:uuid:374b07ed-3a31-4dad-bb68-2f34c2225a45> <urn:uuid:a89831b4-6690-4d05-adfe-09412bd57bc4> .
<urn:uuid:a89831b4-6690-4d05-adfe-09412bd57bc4> <http://www.w3.org/ns/prov#used> <file:///home/jorrit/proj/aja-alignment/name-alignment-bats/input/Complete.tsv> <urn:uuid:a89831b4-6690-4d05-adfe-09412bd57bc4> .
<file:///home/jorrit/proj/aja-alignment/name-alignment-bats/input/Complete.tsv> <http://purl.org/pav/hasVersion> <hash://sha256/eb557c6d8fee0a9177143987ab0f0146b8fa9f6c40ec373056a8a8fa01366836> <urn:uuid:a89831b4-6690-4d05-adfe-09412bd57bc4> .

The history served by https://linker.bio describes someone named jorrit playing with a bats dataset. I was expecting to find the prov logs of linker.bio's monthly crawls of iDigBio & friends. How do I find the May 2023 crawl log?

mielliott commented 1 year ago

Setting the provenance anchor to the the hash of the graph published in https://zenodo.org/record/3852671, I think I'm finding my way to the May log:

$ preston history --remote https://linker.bio -r hash://sha256/8aacce08462b87a345d271081783bdd999663ef90099212c8831db399fc0831b
[https://linker.bio/hash:...099212c8831db399fc0831b] 133 MB at 17.08 MB/s completed in < 1 minute
<hash://sha256/8aacce08462b87a345d271081783bdd999663ef90099212c8831db399fc0831b> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/f13b15a20e4fe70b4a111e67ac20ef676404b8456dfc39694f2cb3a4c62a2b2d> .
[https://linker.bio/hash:...dfc39694f2cb3a4c62a2b2d] 132 MB at 18.27 MB/s completed in < 1 minute
<hash://sha256/f13b15a20e4fe70b4a111e67ac20ef676404b8456dfc39694f2cb3a4c62a2b2d> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/3b39831bcc286c1db44787e21b736378f5847a16b7c39bdac3dd2011e9189dc1> .
[https://linker.bio/hash:...7c39bdac3dd2011e9189dc1] 300 MB at 20.29 MB/s completed in < 1 minute
<hash://sha256/3b39831bcc286c1db44787e21b736378f5847a16b7c39bdac3dd2011e9189dc1> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/916255b2b73680595dcb22b30991a757dd223208473fb4fbe90405757bc07953> .
[https://linker.bio/hash:...73fb4fbe90405757bc07953] 101 MB at 17.20 MB/s completed in < 1 minute
<hash://sha256/916255b2b73680595dcb22b30991a757dd223208473fb4fbe90405757bc07953> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/39f83f5805f32f765003c5e9ee8c69adb3889d9f26dd61bf4aa3a829ac744e2c> .
[https://linker.bio/hash:...6dd61bf4aa3a829ac744e2c] 101 MB at 16.56 MB/s completed in < 1 minute
<hash://sha256/39f83f5805f32f765003c5e9ee8c69adb3889d9f26dd61bf4aa3a829ac744e2c> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/5dcf876c6cb0c5b15197acf1ea6989d41c1a1333c6a7e0437f035aa9d22a3790> .
[https://linker.bio/hash:...6a7e0437f035aa9d22a3790] 93 MB at 16.78 MB/s completed in < 1 minute
<hash://sha256/5dcf876c6cb0c5b15197acf1ea6989d41c1a1333c6a7e0437f035aa9d22a3790> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/9c17ce013b33c3c9e6bc513cb49a14660fad9bd6f87a4f21568cc871b10ba39b> .
[https://linker.bio/hash:...87a4f21568cc871b10ba39b] 93 MB at 18.91 MB/s completed in < 1 minute
<hash://sha256/9c17ce013b33c3c9e6bc513cb49a14660fad9bd6f87a4f21568cc871b10ba39b> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/6c4c94cdb224d39e7c655b1a1a6afbba8daf3c9ac64c42ba72dfd346d5d3a547> .
[https://linker.bio/hash:...64c42ba72dfd346d5d3a547] 87 MB at 15.74 MB/s completed in < 1 minute
<hash://sha256/6c4c94cdb224d39e7c655b1a1a6afbba8daf3c9ac64c42ba72dfd346d5d3a547> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/ff74959ec6e5e98e7db674afcb915f50725f049b968e9a9f10de169aa0a3dcb5> .
[https://linker.bio/hash:...68e9a9f10de169aa0a3dcb5] 89 MB at 13.77 MB/s completed in < 1 minute
<hash://sha256/ff74959ec6e5e98e7db674afcb915f50725f049b968e9a9f10de169aa0a3dcb5> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/ab62f4a9601f30d23353a479830f9d2dfc7898e15d2cc2d81977e898d885c908> .
[https://linker.bio/hash:...d2cc2d81977e898d885c908] 249 kB at 0.41 MB/s completed in < 1 minute
<hash://sha256/ab62f4a9601f30d23353a479830f9d2dfc7898e15d2cc2d81977e898d885c908> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/6fb7271a2da1543036e39bcdb4c415a46b5437569eaaf0ffdef3e907a2f4309f> .
[https://linker.bio/hash:...eaaf0ffdef3e907a2f4309f] 554 kB at 0.76 MB/s completed in < 1 minute
<hash://sha256/6fb7271a2da1543036e39bcdb4c415a46b5437569eaaf0ffdef3e907a2f4309f> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/d79fb9207329a2813b60713cf0968fda10721d576dcb7a36038faf18027eebc1> .
[https://linker.bio/hash:...dcb7a36038faf18027eebc1] 940 MB at 14.39 MB/s
mielliott commented 1 year ago

Some of these logs are really big! ~I wonder why?~ These are old (pre 2019) logs, things were wonkier back then

[https://linker.bio/hash:...dcb7a36038faf18027eebc1] 1102 MB at 15.10 MB/s completed in 1 minute(s)

Ranging from a measly 15MB to a whopping 1.1GB(!)

mielliott commented 1 year ago

Ahhhhhhhhhh shucks....... preston history goes backward in time now. How do I go forward?

mielliott commented 1 year ago

I thought maybe preston head with the root set to the 2020-05-01 crawl would climb its way to a 2023-05-01 crawl, but no luck:

$ preston head --remote https://linker.bio -r hash://sha256/8aacce08462b87a345d271081783bdd999663ef90099212c8831db399fc0831b
hash://sha256/8aacce08462b87a345d271081783bdd999663ef90099212c8831db399fc0831b
mielliott commented 1 year ago

I did something horrible:

$ echo -n "hash://sha256/8aacce08462b87a345d271081783bdd999663ef90099212c8831db399fc0831b" > data/2a/5d/2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a

And now:

$ preston head --remote https://linker.bio
[https://linker.bio/hash:...5f8487e0d6a5233e3cd3146] 100.0% of 78 bytes at 0.02 MB/s completed in < 1 minute
[https://linker.bio/hash:...43a8035d8efcfcb403ec547] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...61377ec6a8fdc64dd1ba0d4] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...de2f5916da27000f1efe004] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...900ac41515ad4f2ae52b8b0] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...8666373d21539de8ef7b4b4] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...56e63fa0992ba19f0ca5c85] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...2ac88992b527564d6563064] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...16f6eaa105d246016f32110] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...6befdd6b3929f332ddd4f26] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...22589d0e2318b3ea44ef779] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...59020b2cd1e235c8c05bd41] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...8c010576282d4833ae096be] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...77f0ceaa0dd220c04012fd2] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...3a81bcfabd4abc385d4b8bc] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...d7fb061187253298435f876] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...abc2dd163b4b161149490ed] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...6d9065b0f72c71434b4794e] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...0c7251e4e6884f768835fbe] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...e7e7833a9cdb2fcb664a5bb] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...9e5e84542c872875edf55f0] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...b138ebc478104e3c115d57f] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...f04b1a56f5e2e29b961b88d] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...63fc758414b6b4a2d68ffc9] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...604f212012225bd01ca99d5] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...26b98e4186d22fb6779c24a] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...171b7b12af3a23b9f00a6b6] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...32a8e3ae31d4daa0fdcc0ed] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...3306edad4b9ae23d7742597] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...f47160214ff1dd998d3c66a] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...8ad9a100307c0dad80381bb] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...bad48714d9219a8f0d9d86d] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...f30ad87f93d058e88390579] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...08552f24897548f28ab8001] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...ad7d2a7b108b9f2b1fa541c] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...54a4931de64ab2d2a22da04] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...da2295dd565f3fc8e0a9ced] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...d0e3044e770c36cc85627a2] 100.0% of 78 bytes at ? MB/s completed in < 1 minute
[https://linker.bio/hash:...d639306a912da088c4e1d3f] 432 MB at 20.76 MB/s completed in < 1 minute
hash://sha256/c5989d88250fd6c92f312dd01afa52126b7f02f29d639306a912da088c4e1d3f

Success! hash://sha256/c5989d88250fd6c92f312dd01afa52126b7f02f29d639306a912da088c4e1d3f is the June 2023 log, which is even better than getting the May log.

$ preston get --remote https://linker.bio 'line:hash://sha256/c5989d88250fd6c92f312dd01afa52126b7f02f29d639306a912da088c4e1d3f!/L25'
<urn:uuid:91772302-544d-4385-a6bd-b2db2bebc6ed> <http://www.w3.org/ns/prov#generatedAtTime> "2023-06-01T03:49:40.361Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <urn:uuid:91772302-544d-4385-a6bd-b2db2bebc6e
jhpoelen commented 1 year ago

@mielliott thanks for sharing the inconsistent behavior re: https://linker.bio . Root cause was the configuration of the linker.bio server, and has been fixed via https://github.com/bio-guoda/preston-service/commit/1ab55c1d1ecef96a56214610fe76e6ef8962a27b . I'll deploy the new configuration momentarily. Please do not that nginx (webserver) has a cache, so the old results may persist for a while. Probably something to think about when swimming upstream using the funky query hashes that we've introduced to make it easy to look into the future.

Curious to hear your thoughts . . .

mielliott commented 1 year ago

Thanks for the fix. At least for the funky query hashes, I can use my local index instead when needed.

I feel like preston head -r [anchor] should swim up the index starting at [anchor], but that's not what I'm seeing (https://github.com/bio-guoda/preston/issues/245#issuecomment-1579024404). Is this a bug, or is there a better way to do it, apart from manually editing the index files?

jhpoelen commented 1 year ago

Bug or a feature. Up to you!