MatrixAI / Polykey

Polykey Core Library
https://polykey.com
GNU General Public License v3.0
31 stars 4 forks source link

Discovery - revisiting Gestalt Vertices and error handling #328

Open emmacasolin opened 2 years ago

emmacasolin commented 2 years ago

Specification

Unattended Discovery was implemented in MatrixAI/Polykey#320, however, nodes/identities that have been previously discovered are currently stored in a set of visitedVertices and cannot be re-added to the queue for discovery. While we don't want to add these vertices back into the queue immediately, we need to design some sort of policy for revisiting these vertices when the discovery queue is empty (and/or after a certain amount of time has passed), so that new claims are reflected in the Gestalt Graph for existing Gestalts. This could potentially be achieved by adding TTLs or a lastVisited tag to vertices.

Note that, since child vertices are added into the queue during discovery, only one vertex (either a node or identity) per Gestalt needs to be added back into the queue for rediscovery.

Additional context

Tasks

  1. Add TTLs or a lastVisited tag to gestalt vertices
  2. Design a policy for revisiting vertices - this should consider the current length of the queue, the vertices currently in the queue, and the amount of time since a given vertex was in the queue
CMCDragonkai commented 2 years ago

Comparing to the NodeGraph, where buckets have TTLs, the bucket TTLs are eagerly refreshed. Whereas most cache TTLs are lazily refreshed.

But the nodes themselves do not have TTLs, they just a last visited tag.

Therefore gestalt vertexes are similar to nodes in this regard. They will have last visited tags, not TTLs. This is because when the TTL expires, we don't actually want to eagerly "discover" the gestalt vertex.

The discovering protocol is not driven by TTL expiry. It should be driven by kinetic priority determined by proximity and trust and interactivity.

CMCDragonkai commented 2 years ago

A bucket which is a group of nodes is therefore similar to a gestalt which is group of gestalt vertices.

Note that, since child vertices are added into the queue during discovery, only one vertex (either a node or identity) per Gestalt needs to be added back into the queue for rediscovery.

This is an interesting comment, because that would mean each gestalt is consider uniformly.

However unlike buckets in the node graph. Gestalts themselves have a priority and this is based on proximity.

So imagine your gestalt as the highest priority.

Then directly trusted gestalts as the second priority.

Then transitively trusted gestalts as subsequent priorities.

Therefore your the discovery loop would prefer discovering your own gestalt first. But not only the initial discovery order. But also preferring updates from your own gestalt first.

This would ensure that we get updates from our own gestalt quickly, but not really care about updates from gestalts completely unrelated to us.

CMCDragonkai commented 2 years ago

Discovery is about discovery cryptolinks between nodes and identities within a gestalt and across gestalts (based on the gestalts that we trust).

At the same time, the user will drive this by triggering a discovery on another identity.

However gestalts themselves would not have a TTL. Instead it would be the gestalt vertices that would have a last updated tag.

We will need to index by the last updated tag. So there will need to be a DB index for the last update.

When we decide to discover a gestalt. We may prefer to to a vertices that have the oldest update tag.

Maybe we can use this: https://github.com/MatrixAI/Polykey/issues/329#issuecomment-1223590432

Where each gestalt can have a "default priority" and some gestalts have higher default priorities.

Let's imagine all the gestalts are structured like a tree.

At the root of the tree, we have our own gestalt.

Then our immediate children are the gestalts that we trust... and so on. Actually this is a DAG, not at tree because 2 of our child gestalts may trust each other, or trust a common third party.

Now the order should then go from us to our immediate children (breadth-first). Topological sort may have some relevance here: https://en.wikipedia.org/wiki/Topological_sorting

The key point though, is that our gestalts are given a default priority. That default priority is a "weight" that decrements as the gestalt gets farther away from us. So suppose a degree N gestalt is N hops away from us, then it would have weight Max - N. Actually let's invert this number, so that 0 is the highest priority and N is some finite number away from 0. So then its weight would be N.

We should factor in fairness here too... so at the very least, we will eventually get to discovering other gestalts.

But there will be a limit, and that is a computational limit.


Alternatively we do something simpler.

We only care about our own gestalt, and also the gestalt of degree 1. Which is the gestalts that we are connected to.

At the same time the user may wish to crawl some random other gestalt. Then we add that into our discovery queue. At that point, we would only crawl itself, and not its children unless prompted to crawl that child gestalts.

Therefore we don't actually auto-crawl beyond one gestalt away from us.


This means that we have a "loop" for deciding what gestalts to focus on. And a "loop" deciding what vertexes to focus on within a gestalt.

If that's the case, we can just use the static priority. Highest priority goes to any user-driven gestalt. Then to our own gestalt. Then finally to all gestalts that we trust.

CMCDragonkai commented 2 years ago

To decide which vertex to visit, we would choose the last updated vertex in a gestalt.

Then if we get a list of child vertices of that vertex, we would then decide based on what was the last updated.

We would prefer vertexes that have never been updated/visited.

So then it is new vertices in the order we find them, then by age of existing vertices that we already know.

CMCDragonkai commented 2 years ago

This means TaskPriority only is used when working between gestalts. For individual vertex jobs, they would just inherit priority.

CMCDragonkai commented 2 years ago

Finally, we can eventually a governor for power control, but we can do that later.

CMCDragonkai commented 2 years ago

This would require a separate PR, it's an independent problem.

CMCDragonkai commented 1 year ago

@tegefaulkes this issue currently should be used to tackle the fact that Discovery as merged with MatrixAI/Polykey#446 currently maintains a visited tag but this does not get refreshed, so we only ever crawl the GG once.

There are 2 ways to deal with this:

  1. Add additional state to the GG, so we add booleans, ttls, last visited.
  2. Create derived state in the discovery domain

Derived state is important here because if the GG deletes state, this would not necessarily translate to deleting state in discovery.

Currently in our KV DB, we don't really have the concept of "foreign keys". Meaning that when we create a new domain to refer to state in another domain, that state is loose. If the upstream state is mutated or deleted, this does not affect the downstream state.

The usual way we deal with this is that the downstream domain ends up encapsulating the mutation operations of the upstream domain. Like we do with GG and ACL.

However this doesn't always make sense, such as with Discovery and GG.

I don't want to implement foreign keys in our DB, that gets us too close to trying to re-implement SQL databases... and we are using rocksdb key-value for a reason.

Since we are doing application-level indexing, we might as well do application-level foreign keys.

To do this, we will need push-based dataflow to allow the discovery state to be reactive to changes in the GG state. This only makes sense to do when the changes don't need to be atomic.

So if discovery state like last visited... etc can be eventually consistent, then we can use push-dataflow to make this work.

This means this design will depend on MatrixAI/Polykey#444.

CMCDragonkai commented 1 year ago

This is now an epic encompassing alot of discovery related issues in terms of revisiting logic and error handling and testing.

CMCDragonkai commented 1 year ago

Quoting https://github.com/MatrixAI/Polykey/issues/493#issuecomment-1319606178

I'm starting to see that having a separate gestalts subcommands would be useful.

pk nodes add
pk nodes find
pk nodes ping
pk nodes getall
pk nodes connections
pk nodes claim         <- claim only node

pk identities authenticate
pk identities authenticated
pk identities search
pk identities claim    <- claim only identity

pk gestalts discover
pk gestalts get
pk gestalts list
pk gestalts claim      <- claim either identity/node
pk gestalts invite     <- invite node to a gestalt
pk gestalts uninvite   <- uninvite node to a gestalt
pk gestalts permissions
pk gestalts allow
pk gestalts disallow
pk gestalts trust      <- short-hand for allow for trust
pk gestalts untrust    <- short-hand for disallow for trust

There are a few ambiguities here.

  1. The claim subcommands can be referring to node or identity or both.
  2. The invite is like pk vaults share and uninvite is like pk vaults unshare. It's combination of setting a permission AND sending a notification.
  3. The trust and untrust is just a short-hand of allow and disallow

We can sort this out in the future issue.

I think the pk gestalts trust and pk gestalts untrust should be removed, it's a bit wishy washy and the generic allow is sufficient for all the kinds of permissions.

Also cross posted to MatrixAI/Polykey-CLI#6.

tegefaulkes commented 7 months ago

As of recent changes, the visitiedVerticies set has been replaced with a persistent tracking of the last time a vertex was visited. So Half of the tasks here have been addressed.

This is also an epic, so it's not technically completed until all the children are addressed as well. In any case, for this issue specifically we still require re-discovery to be implemented which is being worked on in a branch currently. Some small things still need to be solved for that.

CMCDragonkai commented 3 months ago

I had an idea that we can adapt priority algorithms from the opposite of cache eviction algorithms.

In particular the queue of things to discover should be influenced by:

  1. When the user overrides the queue to say I want to discover X.
  2. Most recently used nodes/identities
  3. Most frequently used nodes/identities

Now what does recent and frequent here mean? Well If I'm interacting with gestalt Y frequently, then gestalt Y vertices should be prioritised for discovery. If I'm interacting with gestalt Z recently, it should be prioritised for discovery. Balancing between the 2 is basically a sort of ARC algorithm but used for prioritisation.

CMCDragonkai commented 3 months ago

Interactions in this sense relate to the other operations between PK gestalts. Like vault sharing and so on. This should align with the small-world network idea. Where in most cases you're mostly interested in first-order connections, and even then a subset by which you're actually interacting with. This seems similar to recommendation algorithms used by social media networks.