Open emmacasolin opened 2 years ago
Comparing to the NodeGraph
, where buckets have TTLs, the bucket TTLs are eagerly refreshed. Whereas most cache TTLs are lazily refreshed.
But the nodes themselves do not have TTLs, they just a last visited tag.
Therefore gestalt vertexes are similar to nodes in this regard. They will have last visited tags, not TTLs. This is because when the TTL expires, we don't actually want to eagerly "discover" the gestalt vertex.
The discovering protocol is not driven by TTL expiry. It should be driven by kinetic priority determined by proximity and trust and interactivity.
A bucket which is a group of nodes is therefore similar to a gestalt which is group of gestalt vertices.
Note that, since child vertices are added into the queue during discovery, only one vertex (either a node or identity) per Gestalt needs to be added back into the queue for rediscovery.
This is an interesting comment, because that would mean each gestalt is consider uniformly.
However unlike buckets in the node graph. Gestalts themselves have a priority and this is based on proximity.
So imagine your gestalt as the highest priority.
Then directly trusted gestalts as the second priority.
Then transitively trusted gestalts as subsequent priorities.
Therefore your the discovery loop would prefer discovering your own gestalt first. But not only the initial discovery order. But also preferring updates from your own gestalt first.
This would ensure that we get updates from our own gestalt quickly, but not really care about updates from gestalts completely unrelated to us.
Discovery is about discovery cryptolinks between nodes and identities within a gestalt and across gestalts (based on the gestalts that we trust).
At the same time, the user will drive this by triggering a discovery on another identity.
However gestalts themselves would not have a TTL. Instead it would be the gestalt vertices that would have a last updated tag.
We will need to index by the last updated tag. So there will need to be a DB index for the last update.
When we decide to discover a gestalt. We may prefer to to a vertices that have the oldest update tag.
Maybe we can use this: https://github.com/MatrixAI/Polykey/issues/329#issuecomment-1223590432
Where each gestalt can have a "default priority" and some gestalts have higher default priorities.
Let's imagine all the gestalts are structured like a tree.
At the root of the tree, we have our own gestalt.
Then our immediate children are the gestalts that we trust... and so on. Actually this is a DAG, not at tree because 2 of our child gestalts may trust each other, or trust a common third party.
Now the order should then go from us to our immediate children (breadth-first). Topological sort may have some relevance here: https://en.wikipedia.org/wiki/Topological_sorting
The key point though, is that our gestalts are given a default priority. That default priority is a "weight" that decrements as the gestalt gets farther away from us. So suppose a degree N gestalt is N hops away from us, then it would have weight Max - N. Actually let's invert this number, so that 0
is the highest priority and N
is some finite number away from 0
. So then its weight would be N
.
We should factor in fairness here too... so at the very least, we will eventually get to discovering other gestalts.
But there will be a limit, and that is a computational limit.
Alternatively we do something simpler.
We only care about our own gestalt, and also the gestalt of degree 1. Which is the gestalts that we are connected to.
At the same time the user may wish to crawl some random other gestalt. Then we add that into our discovery queue. At that point, we would only crawl itself, and not its children unless prompted to crawl that child gestalts.
Therefore we don't actually auto-crawl beyond one gestalt away from us.
This means that we have a "loop" for deciding what gestalts to focus on. And a "loop" deciding what vertexes to focus on within a gestalt.
If that's the case, we can just use the static priority. Highest priority goes to any user-driven gestalt. Then to our own gestalt. Then finally to all gestalts that we trust.
To decide which vertex to visit, we would choose the last updated vertex in a gestalt.
Then if we get a list of child vertices of that vertex, we would then decide based on what was the last updated.
We would prefer vertexes that have never been updated/visited.
So then it is new vertices in the order we find them, then by age of existing vertices that we already know.
This means TaskPriority
only is used when working between gestalts. For individual vertex jobs, they would just inherit priority.
Finally, we can eventually a governor for power control, but we can do that later.
This would require a separate PR, it's an independent problem.
@tegefaulkes this issue currently should be used to tackle the fact that Discovery as merged with MatrixAI/Polykey#446 currently maintains a visited tag but this does not get refreshed, so we only ever crawl the GG once.
There are 2 ways to deal with this:
Derived state is important here because if the GG deletes state, this would not necessarily translate to deleting state in discovery.
Currently in our KV DB, we don't really have the concept of "foreign keys". Meaning that when we create a new domain to refer to state in another domain, that state is loose. If the upstream state is mutated or deleted, this does not affect the downstream state.
The usual way we deal with this is that the downstream domain ends up encapsulating the mutation operations of the upstream domain. Like we do with GG and ACL.
However this doesn't always make sense, such as with Discovery and GG.
I don't want to implement foreign keys in our DB, that gets us too close to trying to re-implement SQL databases... and we are using rocksdb key-value for a reason.
Since we are doing application-level indexing, we might as well do application-level foreign keys.
To do this, we will need push-based dataflow to allow the discovery state to be reactive to changes in the GG state. This only makes sense to do when the changes don't need to be atomic.
So if discovery state like last visited... etc can be eventually consistent, then we can use push-dataflow to make this work.
This means this design will depend on MatrixAI/Polykey#444.
This is now an epic encompassing alot of discovery related issues in terms of revisiting logic and error handling and testing.
Quoting https://github.com/MatrixAI/Polykey/issues/493#issuecomment-1319606178
I'm starting to see that having a separate
gestalts
subcommands would be useful.pk nodes add pk nodes find pk nodes ping pk nodes getall pk nodes connections pk nodes claim <- claim only node pk identities authenticate pk identities authenticated pk identities search pk identities claim <- claim only identity pk gestalts discover pk gestalts get pk gestalts list pk gestalts claim <- claim either identity/node pk gestalts invite <- invite node to a gestalt pk gestalts uninvite <- uninvite node to a gestalt pk gestalts permissions pk gestalts allow pk gestalts disallow pk gestalts trust <- short-hand for allow for trust pk gestalts untrust <- short-hand for disallow for trust
There are a few ambiguities here.
- The
claim
subcommands can be referring to node or identity or both.- The
invite
is likepk vaults share
anduninvite
is likepk vaults unshare
. It's combination of setting a permission AND sending a notification.- The
trust
anduntrust
is just a short-hand ofallow
anddisallow
We can sort this out in the future issue.
I think the
pk gestalts trust
andpk gestalts untrust
should be removed, it's a bit wishy washy and the genericallow
is sufficient for all the kinds of permissions.
Also cross posted to MatrixAI/Polykey-CLI#6.
As of recent changes, the visitiedVerticies
set has been replaced with a persistent tracking of the last time a vertex was visited. So Half of the tasks here have been addressed.
This is also an epic, so it's not technically completed until all the children are addressed as well. In any case, for this issue specifically we still require re-discovery to be implemented which is being worked on in a branch currently. Some small things still need to be solved for that.
I had an idea that we can adapt priority algorithms from the opposite of cache eviction algorithms.
In particular the queue of things to discover should be influenced by:
Now what does recent and frequent here mean? Well If I'm interacting with gestalt Y frequently, then gestalt Y vertices should be prioritised for discovery. If I'm interacting with gestalt Z recently, it should be prioritised for discovery. Balancing between the 2 is basically a sort of ARC algorithm but used for prioritisation.
Interactions in this sense relate to the other operations between PK gestalts. Like vault sharing and so on. This should align with the small-world network idea. Where in most cases you're mostly interested in first-order connections, and even then a subset by which you're actually interacting with. This seems similar to recommendation algorithms used by social media networks.
Specification
Unattended Discovery was implemented in MatrixAI/Polykey#320, however, nodes/identities that have been previously discovered are currently stored in a set of
visitedVertices
and cannot be re-added to the queue for discovery. While we don't want to add these vertices back into the queue immediately, we need to design some sort of policy for revisiting these vertices when the discovery queue is empty (and/or after a certain amount of time has passed), so that new claims are reflected in the Gestalt Graph for existing Gestalts. This could potentially be achieved by adding TTLs or alastVisited
tag to vertices.Note that, since child vertices are added into the queue during discovery, only one vertex (either a node or identity) per Gestalt needs to be added back into the queue for rediscovery.
Additional context
Tasks
lastVisited
tag to gestalt vertices