bio-guoda / preston

a biodiversity dataset tracker
MIT License
24 stars 1 forks source link

suggest to track wikidata dump https://dumps.wikimedia.org/wikidatawiki/latest/ #218

Closed jhpoelen closed 1 year ago

jhpoelen commented 1 year ago

@Daniel-Mietchen suggest to track wikidata dump https://dumps.wikimedia.org/wikidatawiki/latest/

jhpoelen commented 1 year ago

October 28 2022 - 10 years of wikidata.

jhpoelen commented 1 year ago

https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2

jhpoelen commented 1 year ago

A first tracked version of wikidata was created:

preston history --remote https://linker.bio --anchor hash://sha256/6ba7c0a9efd1e7dee323ea6140df260ac59eafc54ccc41035e345b7c0bd8e35e
<urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/pav/hasVersion> <hash://sha256/6ba7c0a9efd1e7dee323ea6140df260ac59eafc54ccc41035e345b7c0bd8e35e> .

with

preston head --remote https://linker.bio --anchor hash://sha256/6ba7c0a9efd1e7dee323ea6140df260ac59eafc54ccc41035e345b7c0bd8e35e\
 | preston cat --remote https://linker.bio

yielding

<https://preston.guoda.bio> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#SoftwareAgent> <urn:uuid:e9931eba-306d-4a69-8d54-cc6f37306488> .
<https://preston.guoda.bio> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Agent> <urn:uuid:e9931eba-306d-4a69-8d54-cc6f37306488> .
<https://preston.guoda.bio> <http://purl.org/dc/terms/description> "Preston is a software program that finds, archives and provides access to biodiversity datasets."@en <urn:uuid:e9931eba-306d-4a69-8d54-cc6f37306488> .
<urn:uuid:e9931eba-306d-4a69-8d54-cc6f37306488> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> <urn:uuid:e9931eba-306d-4a69-8d54-cc6f37306488> .
<urn:uuid:e9931eba-306d-4a69-8d54-cc6f37306488> <http://purl.org/dc/terms/description> "A crawl event that discovers biodiversity archives."@en <urn:uuid:e9931eba-306d-4a69-8d54-cc6f37306488> .
<urn:uuid:e9931eba-306d-4a69-8d54-cc6f37306488> <http://www.w3.org/ns/prov#startedAtTime> "2023-02-09T16:13:45.157Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <urn:uuid:e9931eba-306d-4a69-8d54-cc6f37306488> .
<urn:uuid:e9931eba-306d-4a69-8d54-cc6f37306488> <http://www.w3.org/ns/prov#wasStartedBy> <https://preston.guoda.bio> <urn:uuid:e9931eba-306d-4a69-8d54-cc6f37306488> .
<https://doi.org/10.5281/zenodo.1410543> <http://www.w3.org/ns/prov#usedBy> <urn:uuid:e9931eba-306d-4a69-8d54-cc6f37306488> <urn:uuid:e9931eba-306d-4a69-8d54-cc6f37306488> .
<https://doi.org/10.5281/zenodo.1410543> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/dc/dcmitype/Software> <urn:uuid:e9931eba-306d-4a69-8d54-cc6f37306488> .
<https://doi.org/10.5281/zenodo.1410543> <http://purl.org/dc/terms/bibliographicCitation> "Jorrit Poelen, Icaro Alzuru, & Michael Elliott. 2021. Preston: a biodiversity dataset tracker (Version 0.5.2) [Software]. Zenodo. http://doi.org/10.5281/zenodo.1410543"@en <urn:uuid:e9931eba-306d-4a69-8d54-cc6f37306488> .
<urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> <urn:uuid:e9931eba-306d-4a69-8d54-cc6f37306488> .
<urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/dc/terms/description> "A biodiversity dataset graph archive."@en <urn:uuid:e9931eba-306d-4a69-8d54-cc6f37306488> .
<hash://sha256/a0f5e6e655596d719131d173eaa68aec0c0a3d2a59e4bbff58486713a8a26184> <http://www.w3.org/ns/prov#wasGeneratedBy> <urn:uuid:5cd2c875-8f3f-46e1-b04a-4454054f5d44> <urn:uuid:5cd2c875-8f3f-46e1-b04a-4454054f5d44> .
<hash://sha256/a0f5e6e655596d719131d173eaa68aec0c0a3d2a59e4bbff58486713a8a26184> <http://www.w3.org/ns/prov#qualifiedGeneration> <urn:uuid:5cd2c875-8f3f-46e1-b04a-4454054f5d44> <urn:uuid:5cd2c875-8f3f-46e1-b04a-4454054f5d44> .
<urn:uuid:5cd2c875-8f3f-46e1-b04a-4454054f5d44> <http://www.w3.org/ns/prov#generatedAtTime> "2023-02-09T21:44:17.461Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <urn:uuid:5cd2c875-8f3f-46e1-b04a-4454054f5d44> .
<urn:uuid:5cd2c875-8f3f-46e1-b04a-4454054f5d44> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> <urn:uuid:5cd2c875-8f3f-46e1-b04a-4454054f5d44> .
<urn:uuid:5cd2c875-8f3f-46e1-b04a-4454054f5d44> <http://www.w3.org/ns/prov#wasInformedBy> <urn:uuid:e9931eba-306d-4a69-8d54-cc6f37306488> <urn:uuid:5cd2c875-8f3f-46e1-b04a-4454054f5d44> .
<urn:uuid:5cd2c875-8f3f-46e1-b04a-4454054f5d44> <http://www.w3.org/ns/prov#used> <https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2> <urn:uuid:5cd2c875-8f3f-46e1-b04a-4454054f5d44> .
<https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2> <http://purl.org/pav/hasVersion> <hash://sha256/a0f5e6e655596d719131d173eaa68aec0c0a3d2a59e4bbff58486713a8a26184> <urn:uuid:5cd2c875-8f3f-46e1-b04a-4454054f5d44> .

only one content alias (aka URL) is tracked:

preston alias --remote https://linker.bio --anchor hash://sha256/6ba7c0a9efd1e7dee323ea6140df260ac59eafc54ccc41035e345b7c0bd8e35e
<https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2> <http://purl.org/pav/hasVersion> <hash://sha256/a0f5e6e655596d719131d173eaa68aec0c0a3d2a59e4bbff58486713a8a26184> <urn:uuid:5cd2c875-8f3f-46e1-b04a-4454054f5d44> .

and their associated content can be streamed using:

preston cat --no-cache --remote https://linker.bio hash://sha256/a0f5e6e655596d719131d173eaa68aec0c0a3d2a59e4bbff58486713a8a26184\
 > latest-all.json.bz2
jhpoelen commented 1 year ago

to get a verified copy, you'd omit the --no-cache part like:

preston cat --remote https://linker.bio hash://sha256/a0f5e6e655596d719131d173eaa68aec0c0a3d2a59e4bbff58486713a8a26184\
 > latest-all.json.bz2

And, then, preston would first download the file, check the content id, then stream the result.

jhpoelen commented 1 year ago

You can download directly using naming convention:

https://linker.bio/hash://sha256/a0f5e6e655596d719131d173eaa68aec0c0a3d2a59e4bbff58486713a8a26184

like:

curl "https://linker.bio/hash://sha256/a0f5e6e655596d719131d173eaa68aec0c0a3d2a59e4bbff58486713a8a26184"