bio-guoda / preston

a biodiversity dataset tracker
MIT License
26 stars 1 forks source link

facilitate anchor/provenance id management #201

Open jhpoelen opened 2 years ago

jhpoelen commented 2 years ago

preston is a content-addressable graph with a version control system. Similar to how git is a content-addressable filesystem with a version control system user interface written on top of it:

from https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain

Now that you’re here, let’s get started. First, if it isn’t yet clear, Git is fundamentally a content-addressable filesystem with a VCS user interface written on top of it. You’ll learn more about what this means in a bit.

Currently, Preston help to keep track of versions by a forward index implemented via a simple hexastore.

Also, with recent changes, you can list the provenance of a provided version using:

preston origins --anchor hash://sha256/05a877bdb8617144fe166a13bf51828d4ad1bc11631c360b9e648a9f7df2bbcd 

yielding

<hash://sha256/05a877bdb8617144fe166a13bf51828d4ad1bc11631c360b9e648a9f7df2bbcd> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/7efdea9263e57605d2d2d8b79ccd26a55743123d0c974140c72c8c1cfc679b93> .
<hash://sha256/7efdea9263e57605d2d2d8b79ccd26a55743123d0c974140c72c8c1cfc679b93> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/b83cf099449dae3f633af618b19d05013953e7a1d7d97bc5ac01afd7bd9abe5d> .
<hash://sha256/b83cf099449dae3f633af618b19d05013953e7a1d7d97bc5ac01afd7bd9abe5d> <http://www.w3.org/ns/prov#wasDerivedFrom> <hash://sha256/c253a5311a20c2fc082bf9bac87a1ec5eb6e4e51ff936e7be20c29c8e77dee55> .
<urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/pav/hasVersion> <hash://sha256/c253a5311a20c2fc082bf9bac87a1ec5eb6e4e51ff936e7be20c29c8e77dee55> .

where hash://sha256/05a877bdb8617144fe166a13bf51828d4ad1bc11631c360b9e648a9f7df2bbcd is the content id of the Preston archive.

The origins of the archive version (or graph version) is computed from the content id pointing to the version by first resolving the version content, and then resolving the content id related to the "used" relations.

The forward index is generated when new versions are appended to existing ones. So, whereas the provenance of a known graph version can be computed, the forward index is used to explore which other version exist (now or in the future) that use the known graph version. Here the latest known graph version is stored in the forward index, whereas the back index (or known provenance) of a graph version is embedded in the content itself. Obviously, you can only reference things that exist, and can only guess which, as of yet unknown (or future) graph version may use our current graph version. The forward index was introduced as a way to define a common starting point across all graphs to help facilitate traversal/discovery of (as of yet unknown) graph versions.

jhpoelen commented 2 years ago

Also, note that many different forward indexes may lead to a specific graph version. But, only a single provenance graph of a specific graph version exists.

jhpoelen commented 2 years ago

And, for convenience, it might be handy for the command-line tool to remember the last versions that were referenced.

For instance, let's say I cloned previously published Preston archive using:

preston clone --anchor hash://sha256/123abc . . . 

and wanted to append a new version to this, I'd have to say:

preston track "https://example.org" --anchor hash://sha256/abc123...  

but, really, I'd want to say:

preston track https://example.org

where the preston tool knows, from its context (previously cloned version hash://sha256/abc123... ) that I'd probably want to append to a recently used version.

jhpoelen commented 2 years ago

Currently, when starting to track something on a un-initialized preston archive, Preston will automatically generate a forward pointing index starting at urn:uuid:0659a54f-b713-4f86-a917-5be166a14110 (aka as "the" biodiversity graph). And, currently, when cloning an archive from a specific version, this forward pointing index is not generated, but the provenance can be computed from the securely linked content. So, after cloning from a specific version, you'd have to explicitly set the version to append to when adding (tracked) content to the biodiversity graph.

jhpoelen commented 2 years ago

Internally, Git keeps track of equivalent pointers by stashing them into the .git/refs folder, and using a .git/HEAD file to point to the currently selected pointer.

For instance, for a recent GloBI source code git repo, we have:

$ find .git/refs/
.git/refs/
.git/refs/remotes
.git/refs/remotes/origin
.git/refs/remotes/origin/main
.git/refs/remotes/origin/HEAD
.git/refs/heads
.git/refs/heads/main
.git/refs/tags
.git/refs/tags/v0.22.0
.git/refs/tags/v0.24.0
...

with

$ cat .git/HEAD 
ref: refs/heads/main

with

$ cat .git/refs/heads/main 
b44b98c9dcaf4de4c02bcd2d70af02c97a3df793

@mielliott what do you think about keeping provenance log ids around in some .preston folder using the git approach: create some HEAD file that points to a named something (e.g., branch, tag), where that named something contains the hash identifying the version?

Alternatively, we can make re-use the forward pointing index, recreating them when needed on cloning from a specific version Preston version.

mielliott commented 2 years ago

I like the HEAD idea. As long as preston keeps a list of - at the minimum - all known heads of "forward indexes", i.e. leaf nodes in the global provenance graph, so that stuff in data/ stays discoverable by telling preston to switch heads/branches

We may also need to keep forward indexes around, not just the heads. Suppose we want to regenerate a forward index using preston origins - if a provenance log "used" two different previous logs (i.e. a "merge" happened), how does preston decide which one is used to build the forward index?

Note that, if we start using "heads" as entry points into the provenance graph, I think a forward index is only needed to preserve current behavior of commands like preston ls and preston history. This is a big ol' can of worms though. I vote for whatever requires the least amount of work without risking loss of content discoverability.

jhpoelen commented 2 years ago

@mielliott thanks for sharing. I am pacing around the room trying to figure out how to implement some intuitive way to keep track of prov log versions.

In my mind, the forward indexes are designed to find some leaf node in the global (universal?) provenance graph.

The leaf nodes are then used to:

(1) append a new provenance log to (e.g., preston track) or,

(2) discover the origins of the provenance log (e.g., preston origin, preston ls). These origins include related provenance logs and other content via their securely embedded references.

In other words, (1) moves forward and updates the HEAD (a read/write operation), and (2) looks backward given some HEAD (a read-only operation).

jhpoelen commented 2 years ago

Historically, we had preston history and preston ls list provenance in chronological order, starting at the "big bang", also known as urn:uuid:0659a54f-b713-4f86-a917-5be166a14110 .

However, it may make more sense (and more secure), to make preston history display in reverse chronological order, starting with the most recent provenance log, and ending with the "big bang", or some other dead end (e.g., some unresolved provenance log).

so instead of having preston history do:

from big bang, list all linked provenance logs

you'd have:

P1. find HEADs - If not explicitly provided, resolve one (or more) most recent heads relevant in the current context (using forward indexes or other mechanism) P2. list provenance of HEAD, starting at HEAD

and for appending a new provenance log via preston track or preston append, you'd have:

A1. find HEADs (same as P1) (read-only) A2. use HEAD reference(s) in new provenance log (write-append) A3. update HEAD after closing provenance log (write-append)

mielliott commented 2 years ago

However, it may make more sense (and more secure), to make preston history display in reverse chronological order, starting with the most recent provenance log, and ending with the "big bang", or some other dead end (e.g., some unresolved provenance log).

I'm all for this, and I think this is the future. And I think it's more intuitive for new folks. More often than not, users are more concerned with the most recent stuff rather than retracing the world from the big bang. Change is scary though... and induces a lot of pacing around the room

jhpoelen commented 2 years ago

And, for keeping track of HEAD - we'd have one or more implementations: the current forward index and a more simplistic HEAD file with a content id in it. Whenever a HEAD is updated on an uninitialized read-write preston archive, a new index is created (either a new version following big bang, or populating a HEAD). Whenever a HEAD is updated that does not connect to the pre-existing HEADs (e.g., trying to explicitly add to a provided provenance log id which is not the same as the locally known HEAD), an error is thrown: "merge conflict: local head [hash://sha256/abc123...] is different than provided head [hash://sha256/def456...]" .

@mielliott I'd start with the most simplistic implementation (re-using existing forward-index), and we can add other methods to help keep or improve the discoverability of the universal knowledge graph. (need a better word for this now that name Teddy-verse has been rejected by both Ted N. and Jose F.).

jhpoelen commented 2 years ago

@mielliott curious to hear any additional thoughts you may have, including, "hey, why not do this or that?"

jhpoelen commented 2 years ago

btw - an argument against HEAD is that the approach does not support a append-only approach. . . and needs to overwrite existing content instead of appending to that hexastore index as separate, and uniquely named files.

jhpoelen commented 2 years ago

can't erase the Teddy-verse right?

mielliott commented 2 years ago

needs to overwrite existing content instead of appending to that hexastore index as separate, and uniquely named files.

I have no beef with editing hexastore index files -- they aren't identified by their content, so they aren't really representing a point in the Teddy-verse. They answer a question in a specific context, and if that context (e.g. head) changes over time, I might argue that the answer should too.

Just a minute, need to think more about what you've proposed for HEAD

mielliott commented 2 years ago

if that context (e.g. head) changes over time

or space, e.g. different preston archives can store different content for the first index file, hash://sha256/2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a

if question-identified (hexastore) content can vary in space, maybe it can vary in time too

Maybe the argument here is that the hexaverse =/= the teddyverse. Although snapshots of the hexaverse could be stored in the teddyverse, we currently don't do that. So, the hexaverse and teddyverse share the same identifier space (all possible hashes and hash algorithms), but are otherwise different things

jhpoelen commented 2 years ago

Yes, I agree that the hexaverse (or totle-verse? after Aristotle) is apples compared to the Teddy-verse as pears.

Perhaps this idea of rewriting the totle-verse across space and time is intuitive. And, the totle-verse keeps valuable context to what questions where asked, or what topics were carved out of the Teddy-verse. For instance, for a publication, I'd say that you'd want to fix the totle-verse over space, and append over time. And perhaps for a local workspace, you'd want to nuke the totle-verse explicitly if your interests switch from topic A to topic B.

Perhaps a totle-verse reset can be expressed as:

preston reset --> deletes totle-verse

or

preston reset --archor hash://sha256/abc123... to create a totle-verse starting populated with hash://sha256/abc123... as the answer to the question: what is all knowledge? Or, where did knowledge began here?

mielliott commented 2 years ago

or totle-verse? after Aristotle

Should we ask Aristotle what he thinks of this name? Do we dare?

preston reset --archor hash://sha256/abc123... to create a totle-verse starting populated with hash://sha256/abc123... as the answer to the question: what is all knowledge? Or, where did knowledge began here?

Does this look like setting the content of the first index entry, hash://sha256/2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a? Then

$ preston reset hash://sha256/abc123...
$ preston get hash://sha256/2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a
hash://sha256/abc123...

For example, for this amazon archive:

$ preston history
<urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/pav/hasVersion> <hash://sha256/d8f72bd865686e99eac413b36d198fd15f305966d2864091320f1868279451ff> .
<hash://sha256/d7b73e3472d5a1989598f2a46116a4fc11dfb9ceacdf0a2b2f7f69737883c951> <http://purl.org/pav/previousVersion> <hash://sha256/d8f72bd865686e99eac413b36d198fd15f305966d2864091320f1868279451ff> .
<hash://sha256/1aa34112ade084ccc8707388fbc329dcb8fae5f895cb266e3ad943f7495740b3> <http://purl.org/pav/previousVersion> <hash://sha256/d7b73e3472d5a1989598f2a46116a4fc11dfb9ceacdf0a2b2f7f69737883c951> .
<hash://sha256/d20deca846391aec439ca5dd04dc8e996229921d21e11ef2ca6666d8798b160d> <http://purl.org/pav/previousVersion> <hash://sha256/1aa34112ade084ccc8707388fbc329dcb8fae5f895cb266e3ad943f7495740b3> .
<hash://sha256/49c7a66a6b3c5507f9c1519791ef17b795a6e6e52a5e0cd6dd10866dcd1c51d7> <http://purl.org/pav/previousVersion> <hash://sha256/d20deca846391aec439ca5dd04dc8e996229921d21e11ef2ca6666d8798b160d> .
<hash://sha256/6d924b3cc007cdb2fd78eab535dd9102563ebdddf4e0e30b00b50bde555f5e68> <http://purl.org/pav/previousVersion> <hash://sha256/49c7a66a6b3c5507f9c1519791ef17b795a6e6e52a5e0cd6dd10866dcd1c51d7> .
<hash://sha256/e9ede0e9f18b4b13694cf9efa373a1903a4af7e74004c0e95b42ed200a95db0a> <http://purl.org/pav/previousVersion> <hash://sha256/6d924b3cc007cdb2fd78eab535dd9102563ebdddf4e0e30b00b50bde555f5e68> .
<hash://sha256/0a2035a61176d7bcafbf0e500152d25ba45a15ebbfdfb03c655657a94c798681> <http://purl.org/pav/previousVersion> <hash://sha256/e9ede0e9f18b4b13694cf9efa373a1903a4af7e74004c0e95b42ed200a95db0a> .

$ preston reset hash://sha256/49c7a66a6b3c5507f9c1519791ef17b795a6e6e52a5e0cd6dd10866dcd1c51d7

$ preston history
<urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/pav/hasVersion> <hash://sha256/49c7a66a6b3c5507f9c1519791ef17b795a6e6e52a5e0cd6dd10866dcd1c51d7> .
<hash://sha256/6d924b3cc007cdb2fd78eab535dd9102563ebdddf4e0e30b00b50bde555f5e68> <http://purl.org/pav/previousVersion> <hash://sha256/49c7a66a6b3c5507f9c1519791ef17b795a6e6e52a5e0cd6dd10866dcd1c51d7> .
<hash://sha256/e9ede0e9f18b4b13694cf9efa373a1903a4af7e74004c0e95b42ed200a95db0a> <http://purl.org/pav/previousVersion> <hash://sha256/6d924b3cc007cdb2fd78eab535dd9102563ebdddf4e0e30b00b50bde555f5e68> .
<hash://sha256/0a2035a61176d7bcafbf0e500152d25ba45a15ebbfdfb03c655657a94c798681> <http://purl.org/pav/previousVersion> <hash://sha256/e9ede0e9f18b4b13694cf9efa373a1903a4af7e74004c0e95b42ed200a95db0a> .

Or does it also delete the entire old index, so

$ preston reset hash://sha256/49c7a66a6b3c5507f9c1519791ef17b795a6e6e52a5e0cd6dd10866dcd1c51d7

$ preston history
<urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/pav/hasVersion> <hash://sha256/49c7a66a6b3c5507f9c1519791ef17b795a6e6e52a5e0cd6dd10866dcd1c51d7> .

This second option seems more useful. But then there's a third option that is more git-like, where preston reset would rollback history instead of changing the big bang, i.e. delete everything in the index that comes after the specified hash. In this case we'd get

$ preston reset hash://sha256/49c7a66a6b3c5507f9c1519791ef17b795a6e6e52a5e0cd6dd10866dcd1c51d7

$ preston history
<urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/pav/hasVersion> <hash://sha256/d8f72bd865686e99eac413b36d198fd15f305966d2864091320f1868279451ff> .
<hash://sha256/d7b73e3472d5a1989598f2a46116a4fc11dfb9ceacdf0a2b2f7f69737883c951> <http://purl.org/pav/previousVersion> <hash://sha256/d8f72bd865686e99eac413b36d198fd15f305966d2864091320f1868279451ff> .
<hash://sha256/1aa34112ade084ccc8707388fbc329dcb8fae5f895cb266e3ad943f7495740b3> <http://purl.org/pav/previousVersion> <hash://sha256/d7b73e3472d5a1989598f2a46116a4fc11dfb9ceacdf0a2b2f7f69737883c951> .
<hash://sha256/d20deca846391aec439ca5dd04dc8e996229921d21e11ef2ca6666d8798b160d> <http://purl.org/pav/previousVersion> <hash://sha256/1aa34112ade084ccc8707388fbc329dcb8fae5f895cb266e3ad943f7495740b3> .
<hash://sha256/49c7a66a6b3c5507f9c1519791ef17b795a6e6e52a5e0cd6dd10866dcd1c51d7> <http://purl.org/pav/previousVersion> <hash://sha256/d20deca846391aec439ca5dd04dc8e996229921d21e11ef2ca6666d8798b160d> .
<hash://sha256/6d924b3cc007cdb2fd78eab535dd9102563ebdddf4e0e30b00b50bde555f5e68> <http://purl.org/pav/previousVersion> <hash://sha256/49c7a66a6b3c5507f9c1519791ef17b795a6e6e52a5e0cd6dd10866dcd1c51d7> .
<hash://sha256/e9ede0e9f18b4b13694cf9efa373a1903a4af7e74004c0e95b42ed200a95db0a> <http://purl.org/pav/previousVersion> <hash://sha256/6d924b3cc007cdb2fd78eab535dd9102563ebdddf4e0e30b00b50bde555f5e68> .

I mention the third option because that's how git reset x operates, and keeping some symmetry between git behavior and preston behavior might make preston easier to use for newcomers. It's solving a whole different problem though, e.g. it gives users an undo button to fix mistakes. If I do preston track https://google.co, I might want to undo that and do preston track https://google.com instead. Deletion is scary, but at the same time, preston commands that append to the prov log can be even scarier because eventually I'm gonna make mistakes and build up garbage in the log.

Sorry for the tangent. I guess I'm proposing to reword the command for your use case to something like preston restart etc.

mielliott commented 2 years ago

In any case, when the index is revised, I think it would be good to keep the old index around somehow for the sake of keeping stuff in data/ discoverable.

And as for the attractiveness of append-only stores, I think it's also worth noting that as preston has grown up, it has enabled a lot of non-archival-related use cases, where clean-up of cached data might be desirable/necessary. Especially cleaning up stuff in data/ that is not discoverable, e.g. when doing preston cat --remote and storing stuff that isn't mentioned in any *locally discoverable* provenance logs. Not trying to go on a tangent about a preston clean command, I just think this is very relevant to the issue of "anchor/provenance id management" and its use cases

mielliott commented 2 years ago

One more comment, then I gotta switch gears to other stuff.

In my mind, the forward indexes are designed to find some leaf node in the global (universal?) provenance graph.

The leaf nodes are then used to:

(1) append a new provenance log to (e.g., preston track) or,

(2) discover the origins of the provenance log (e.g., preston origin, preston ls). These origins include related provenance logs and other content via their securely embedded references.

In other words, (1) moves forward and updates the HEAD (a read/write operation), and (2) looks backward given some HEAD (a read-only operation).

I fully agree with this. The one unique usecase for forward indexes is to discover new stuff, i.e. to ask another archive "do you a newer node in the biodiversity graph?" Other than that, keeping a head pointer and tracing the graph in reverse (preston origins) can do everything else that the forward indexes can/currently do. The one hairy issue I can think of is ambiguity in recreating a forward index, e.g. when preston origins runs into a fork, which version should be used to create the index? So, I propose a solution: include all of them. The forward index then becomes a tree. If preston history and preston ls are changed to operate in reverse, then the forward index no longer needs to represent a single path in the biodiversity graph.

Maybe we have a preston fetch --remote (or whatever) command. Its job is to update the forward index (starting from HEAD? Maybe also update HEAD) by asking remote repositories for their index entries.

Repository A's first index entry looks like

hash://sha256/aaa

and repository B's first index entry looks like

hash://sha256/bbb

then running preston fetch --remote https://repo-b.com updates repo A's first index entry to

hash://sha256/aaa
hash://sha256/bbb

the HEAD file can point to any hash in the tree, and the hexastore can be used to find newer nodes in the tree, or update existing entries

jhpoelen commented 2 years ago

@mielliott thanks for sharing your notes.

I like your idea to try and reuse existing git slang/behavior by implementing the 3rd proposed preston reset hash://sha256/abc123... .

I also like your "preston clean" idea. Much like doing a copy using preston cp some/path and then nuking the old location, leaving the unlinked/ unreferenced data behind (for better or worse).

I also see parallels with preston verify in that preston clean would try and visit the provenance logs and their linked content.

One use case I am not clear on yet -

What would you expect the local index to be after doing:

preston clone https://linker.bio --anchor hash://sha256/abc123...

?

I am currently leaning towards:

<big bang> <hasVersion> <hash://sha256/abc123...> .

where you can still do:

preston origin hash://sha256/abc123...

to uncover the origins of the archive.

mielliott commented 2 years ago

One use case I am not clear on yet -

What would you expect the local index to be after doing:

I am currently leaning towards: ...

I think that's great! Then, preston clone https://linker.bio, without a specified anchor, will clone the whole thing, index and all

where you can still do:

preston origin hash://sha256/abc123...

to uncover the origins of the archive.

Love it!

jhpoelen commented 2 years ago

good stuff. I'll try and motivate myself to take a stab at implementing all this.

mielliott commented 2 years ago

We commented right after each other, did you see https://github.com/bio-guoda/preston/issues/201#issuecomment-1312167525? I think changing the hexastore index this way paves the way for easy branch management, and a clearer idea of how the HEAD pointer can work

jhpoelen commented 2 years ago

yes, I did see https://github.com/bio-guoda/preston/issues/201#issuecomment-1312167525 .

And, I agree that recreating a forward index can be a bit dicey. I propose to not try and recreate a forward index when none is available. We can always change this behavior, because we wouldn't lose any provenance / linked content by omitting this index for now.

Are you ok with that?

mielliott commented 2 years ago

I propose to not try and recreate a forward index when none is available.

Agreed. We don't need to recreate the forward index from provenance, but it's good to remember that all the information needed to do so is there. I suppose the exciting idea, for me, was more about updating and merging indexes without causing conflicts. It's fun to imagine running preston clone --remote [url1] and then preston clone --remote [url2] and getting all the data from both places without running into any index conflicts. Then HEAD indicates which branch to append to etc.

jhpoelen commented 2 years ago

Ok, sounds like we''ll leave the fixing of merge conflicts in the Totle-verse for time other time. I'll keep you posted on my progress.

mielliott commented 2 years ago

Sounds good