TREEcg / specification

RDF vocabulary and hypermedia specification to publish your Linked Data using search trees
https://w3id.org/tree/specification
27 stars 12 forks source link

Introduce a subclass of tree:Relation e.g. tree:SubsetRelation to Identify the Minimum spanning tree inside a graph, which could facilitate the search from the perspective of a (tree) client. #85

Closed xdxxxdx closed 3 months ago

xdxxxdx commented 9 months ago

Hope this is a clear explanation of the idea from @sandervd

Introduce a subclass of tree:Relation e.g. tree:SubsetRelation to Identify the Minimum spanning tree inside a graph, which could facilitate the search from the perspective of a (tree) client.
e.g.

image

Diagram -1 As presented in diagram-1, when the Client starts fromtree:Node marked as 1, the client could only follow the tree:SubsetRelation to arrive at all tree:Node.

image

Diagram - 2 If the client starts from the tree:Node marked as 2 (as shown in Diagram - 2), this implies the client is only interested in the node under the subset condition of tree:SubsetRelation marked as 3 (Diagram - 2)

In practice, Current available tree:Relation(s): tree:PrefixRelation , tree:SubstringRelation , tree:SuffixRelation ,tree:GreaterThanRelation, tree:GreaterThanOrEqualToRelation, etc.. can be typed as tree:SubsetRelation when it is the case.

pietercolpaert commented 9 months ago

Doing my own translation of the problem and proposal first before diving deeper into possible other solutions at a later time:

When a client follows their own traversal approach, it is automatically going to prune relations to nodes it already visited. The relations it prunes could then be seen as client-decided non-subsetrelations or the gray relations.

The proposal here is that the server annotates the relations that will create one of the possible trees from the graph. The benefit of this proposal would be that the client can trust that, if it only decides to follow the relations annotated by the server, no parent or sibling relations (gray relations) will be provided, and that

  1. It thus doesn’t need to check whether the relation points at a node it already visited.
  2. The node IRI can be used as a bookmark of where it left off
sandervd commented 9 months ago

The importance here is that by defining a relation as a subset, we introduce transitivity of traversed relations:

For example, say we want to partition a geospatial dataset: We make a tree with 3 sublevels: tree:zoom, tree:longitudeTile and tree:latitudeTile A member is partitioned in zoomlevel 11, long 5 and lat 9. We define the relations as subset relations, so that: (Zoomlevel11_X5_Y9⊆Zoomlevel11_X5)∧(Zoomlevel11_X5⊆Zoomlevel11)∧(Zoomlevel11⊆RootFragment) ⟹Zoomlevel11_X5_Y9⊆Zoomlevel11 As the relations are transitive, we are sure that the semantics of upper levels (relations traversed going down the hierarchy) still apply, without having to restate them in each relation as we move down.

I think we would run into issues with the pruning algorithm of relations when either trees are crawler depth first or are not coherent in structure. For instance, if in the above example a depth first crawl had taken place, and a relation from the Zoomlevel11_X5_Y9 fragment would point to, say Zoomlevel12_X5_Y9, by using the tree:zoomlevel relation. As the tree is walked depth first, only a relation towards Zoomlevel12 has been discovered so far (the relation to Zoomlevel12 is observed on the root node), but neither are fragments Zoomlevel12_X5 or Zoomlevel12_X5_Y9. A client would as such traverse this relation.

If the client plans to walk the whole tree, this is probably not much of an issue. But I do think that it would be best to define which relations are transitive in relation to others. I think defining a relation type hasSubSet is an intuitive construct for doing so, as it allows reasoning about the tree structure independent of the used relations; and makes the implied transitivity of relations explicit. (Right now, following a tree:longitudeTile relation after traversing a tree:zoom relation only implies that the tree:zoom relation is transitive, but this isn't explicit)

pietercolpaert commented 8 months ago

I would argue the transitivity is something that at this moment the client must take into account; but it is not guaranteed by the server, because the server wants to document multiple ways of arriving at a similar node. It’s thus the client that must be smarter, and understand that following a link it already visited of course won’t result in new members to be discovered.

Why we decided to add this in the past:

  1. We thought servers could make mistakes and provide cycles, so we’d have to account for them anyway (I would dare to refute that today: we can solve this using validation)
  2. We wanted to allow search forms and multiple entry points. Depending on the entry point that then would be chosen, the client can make its own search tree from there on.
  3. We wanted to not restrict TREE hypermedia graphs to only search trees: maybe more interesting fragmentations could be thought of that were no strict search trees

However, the benefits of extending the spec somehow to allow for saying: “hey, I’m a server that won’t provide search forms, backlinks or sibling links, so you can just keep your state as the yet to be fetched nodes without keeping the full history of what you’ve already seen” is big. E.g., the issue on state in LDES could benefit a lot from it: https://github.com/SEMICeu/LinkedDataEventStreams/issues/31

While we didn’t want to close the door on other use cases, I thus do believe a server should have a way to indicate to server that it can guarantee a search tree without cycles, so you don’t need to keep a list of all visited pages and members as part of your state. This should lead to a lower memory footprint and potentially a higher performance.

Different options:

1. Annotating the relations, cfr. the SubsetRelations proposal ↑

As proposed by @xdxxxdx, we would somehow annotate the relations with a specific traversal strategy that will be guaranteed to be a search tree by the server. However, in this case I don’t see the benefit of adding sibling or parent relations at all, as you wouldn’t be able to use search forms with this proposal anymore either. I’m thus not a big fan of the proposed solution.

2. Annotating the tree:Node to say it’s a disjoint subgraph from here

When it’s an entry node, you don’t have history yet. If however you follow a link to a node that is again linked using tree:view, then you know you will have to keep a list of visited pages because you will be able to reach all members again from that node.

Instead of tree:view we now have void:subset or dcterms:isPartOf that can be used. They carry poor semantics: it just says it’s a part of the collection, but doesn’t mention anything else.

We could propose a new property, for example tree:disjointPartialView (better property name pending) to indicate the server guarantees that it will not provide links to parents or siblings.

xdxxxdx commented 8 months ago

Hello @pietercolpaert, Thanks so much for considering this GitHub issue. There are a few points I didn’t understand about the definition tree:disjointPartialView.

How to use tree:disjointPartialView?

image

In this graph, the subject of the tree:disjointPartialView is Node 2 or Node 3. If it is Node 2, does the client need to follow the relation defined in Node 2? If it is Node 3, does this mean the client ONLY needs to follow the subject of the tree:disjointPartialView instead of the relation defined in Node2?

How to gurantee search form with tree:disjointParitialView ?

With the entry point of the subject of the tree:disjointParitialView, a strict tree graph needs to be guaranteed by the server side right? How search forms be guaranteed in the Tree structure with tree:disjointPartialView inside? From my perspective, any entry point after the Subject oftree:disjointPartialView cannot provide a possibility to have whole member access anymore.

Conceptually, each deeper downlink means an AND with all previous relations

I am also very confused about this slide https://docs.google.com/presentation/d/1lIZWUaK-iYFlqTpGInXokwPbD-x-rnvjll96bj-9qkQ/edit#slide=id.g27ef07a4a56_0_42. If there is no clear indication saying "Hi, I am the hierarchical relations, how you can know if A ->B->C is a hierarchical continued relation, or it is sibling relations “. **_

Conceptually, each deeper downlink means an AND with all previous relations

_**. Please remind me where this is mentioned in the SPEC, or is it implicit? (posted also on Slack)

Clarification about tree:SubsetRelation

Also for the proposal: image To be clear, we are not proposing only use tree:SubsetRelation(s). The sibling, loopback relations are also open to being added to the graph (by sibling, loopback relations I mean, the relations are not tagged as a tree:SubsetRelation). Then we are open to letting the client choose only the following tree:SubsetRelation, or following all kinds of relations (if interested in a search form).

@sandervd , please feel free to add/modify. Thanks

pietercolpaert commented 8 months ago

@xdxxxdx I’m not entirely sure what you mean here. Let’s have a call about this before the next CG meeting, and present our common understanding there. Afterwards we can update and comment here a motivation for a potential design decision

pietercolpaert commented 6 months ago

During the 11th TREE CG meeting of 2023-12-20 we discussed a possible resolution for this discussion:

Quoting myself from the past:

  1. We thought servers could make mistakes and provide cycles, so we’d have to account for them anyway

I refute that today: we can solve this using validation.

  1. We wanted to allow search forms and multiple entry points. Depending on the entry point that then would be chosen, the client can make its own search tree from there on.

Based on our discussions, we said it would make sense that a search form would jump to a page from which not all members can be reached, but explicitly a subset searched for. I agree with that conclusion. This should be added to the spec on traversing relations.

  1. We wanted to not restrict TREE hypermedia graphs to only search trees: maybe more interesting fragmentations could be thought of that were no strict search trees

The only use case for not having a search tree we saw so far in implementations would be a double linked list (a single linked list is strictly speaking a search tree).

We therefore would make it explicit in the specification for view builders, that tree:Relations MUST only link to other nodes that are not linked from other nodes. Each tree:Node therefore has exactly one other tree:Node that links towards it, except for the tree:RootNode as proposed in #100.

In order to maintain backwards compatibility with existing implementations, we would however make it mandatory to still prune relations to the previous page.

Impacted issue is certainly how clients can keep state to resume later on, as for example is the case in LDES: https://github.com/SEMICeu/LinkedDataEventStreams/issues/31

kkostov commented 5 months ago

We therefore would make it explicit in the specification for view builders, that tree:Relations MUST only link to other nodes that are not linked from other nodes. Each tree:Node therefore has exactly one other tree:Node that links towards it, except for the tree:RootNode as proposed in https://github.com/TREEcg/specification/issues/100.

Isn't this a bit too restrictive? The way I interpret it, it also implies that two collections for example can't share a fragment?

graph TD;
    C1-->B;
    C2-->B;
pietercolpaert commented 5 months ago

Isn't this a bit too restrictive? The way I interpret it, it also implies that two collections for example can't share a fragment?

I think that’s left open

It’s indeed a bit restrictive as you won’t be able to have 2 nodes pointing at the same other node - but I haven’t seen use cases so far that would require that.

pietercolpaert commented 3 months ago

To be proposed in a PR: