bio-guoda / preston

a biodiversity dataset tracker
MIT License
24 stars 1 forks source link

append provenance index file on anchored preston track cmd #256

Open jhpoelen opened 1 year ago

jhpoelen commented 1 year ago

In situation of existing Preston package (like [1]), I'd like to be able to append a new version without having to clone to entire package with history from the beginning of time.

However, while,

preston track\
 --remote https://zenodo.org/record/8125362/files\
 --anchor hash://sha256/0e9bc57bc082b58a2c7a509bb73362b258ec8ddfc6664898e25c639786413fda\ "https://www.discoverlife.org/mp/20q/?act=x_checklist&guide=Apoidea_species&flags=HAS"

produces a linked provenance log (preston ls | grep "0e9bc57bc082b58a2c7a509bb73362b258ec8ddfc6664898e25c639786413fda" with statement:

<hash://sha256/0e9bc57bc082b58a2c7a509bb73362b258ec8ddfc6664898e25c639786413fda> <http://www.w3.org/ns/prov#usedBy> <urn:uuid:c93732f1-58dc-4bd2-9fca-db6ad9301748> <urn:uuid:c93732f1-58dc-4bd2-9fca-db6ad9301748> .

it does not produce a provenance query index. Instead, a "root" query index is generated with query hash hash://sha256/2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a is generated.

e.g.,

find .

produced:

.
./data
./data/2a
./data/2a/5d
./data/2a/5d/2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a
./data/c4
./data/c4/fc
./data/c4/fc/c4fc072c4977b8a55fc386402b6b2b3128f9de27349e61b369c887ce88e525e8
./data/37
./data/37/48
./data/37/48/37489fc6e5bcfb53e996ddfbc28ef0bb8c00470891f31bb91370953f05235a1d

This means that when the provenance query index file is added to an existing publication, the corpus would not allow for traversing from the first version (the provenance "root") to the most recently added version.

Suggest to instead generate a provenance query index that answers the question: what is the provenance log that came after 0e9bc57bc082b58a2c7a509bb73362b258ec8ddfc6664898e25c639786413fda ? Instead of the current provenance query index that answers the question : what is the first provenance log?

References

[1] Poelen, Jorrit H. (2023). Nomer Corpus of Taxonomic Resources hash://sha256/0e9bc57bc082b58a2c7a509bb73362b258ec8ddfc6664898e25c639786413fda hash://md5/91dd844e787ffae8f0a2bbb8c1f29192 (0.16) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8125362

jhpoelen commented 1 year ago

On a related note, preston head --anchor https://sha256/123... is expected to return the most recent provenance log, even if an anchor has been provided and a local pointer to a newly appended provenance log exists:

with

preston history
<hash://sha256/30845fefa4a854fc67da113a06759f86902b591bf0708bd625e611680aa1c9c4> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/b1937f9fb1d84b02f2e0cd6e11018688fd009280394a7c1fd264c10de9b14998> .
<urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/pav/hasVersion> <hash://sha256/b1937f9fb1d84b02f2e0cd6e11018688fd009280394a7c1fd264c10de9b14998> .

we'd expect:

preston head --anchor hash://sha256/b1937f9fb1d84b02f2e0cd6e11018688fd009280394a7c1fd264c10de9b14998

to yield:

hash://sha256/30845fefa4a854fc67da113a06759f86902b591bf0708bd625e611680aa1c9c4