Better data structure for the incomplete tree in the database

DCMLab / tree-annotation-code

source code for repository tree-annotation-app

GNU General Public License v3.0

0 stars 2 forks source link

Better data structure for the incomplete tree in the database #6

Closed chfin closed 4 years ago

chfin commented 4 years ago

The incomplete tree should be represented as a list of trees, starting with singleton trees for the leaves.

Database operations should be operations on this data structure. For example, merging takes a subsequence of trees from the list and replaces them with a single tree that has the old roots as direct children of the new root. The node components can either be defined recursively, or be obtained as they are by flattening the tree.

Indexing may be done by giving nodes ids or by a list of numbers corresponding to the index of the selected child at each node.

dharasim commented 4 years ago

It's a good point! I actually started with a representation like this, but it makes it more complicated and more expensive to update the label of an inner node. In the current representation, only the renamed node needs to change. When you have a direct forrest representation, all the parents and children of this inner node need to change, too. That's why I currently store all nodes in a map with coordinate keys and don't use direct pointers for children and parents. Does this make sense?

If we ever port this to Rust to use WebAssembly, we also cannot use doubly-linked data structures because it leads to unclear ownership of the Data 😃

chfin commented 4 years ago

Yes, doubly linked structure doesn't make sense, but that's why I suggested to use indices to nodes instead of pointers. Using these indices, it would be possible to separate the node labels and the forest structure (which is probably a good idea anyway), so you avoid the update problems. But I don't think it would be a problem to update larger parts of the forest on a label update, performance wise.

dharasim commented 4 years ago

Of cause, performance is not an issue here :)

So if we use the node coordinates as indices, then we would keep data such as labels and node states in the nodes map and remove storing the children and parents there. Instead, we add a new list to the db that we call trees in which each tree node is a map with keys :coord, :child-coords, and :parent-coord.

Is this what you mean?

chfin commented 4 years ago

More or less. Yes for the separate entries for labels and structure. But I would store the structure directly as a list of trees and implement combine and remove on that data structure. Since the indices of the nodes change, I would attach IDs to the nodes in the forest and use these IDs as the keys for the label map.

dharasim commented 4 years ago

Ok, I think we mean almost the same thing here. I just don't get why we should IDs that are different from the coordinates of a node. Can you give an example in which a node is not uniquely determined by its coordinates?

chfin commented 4 years ago

No, coordinates should be fine since they should be unique, and can serve as ids. I was talking about indices in the form of lists of numbers, where you pick out a tree from the list by the first number, the child of the tree's root by the second number and so on. This index changes when you change the trees above a node, so you cannot associate it with labels.

The other thing is that we shouldn't links (indices/coordinates/...) to the parents and link to children not by coordinates/indices/ids, but by nesting nodes themselves, i.e. creating an actual tree structure instead of an "adjacency list"-like representation. This prevents the construction of links that violate the forest structure.

dharasim commented 4 years ago

Yes, that makes sense. I was confused about IDs vs. indices.

chfin commented 4 years ago

solved by #11