Tree terminology - Githubissues

sattlerc commented 4 months ago

Here is a discussion about tree terminology. I'd love to hear your thoughts.

When I teach search trees, I start by discussing trees. In that context, that means rooted trees, sometimes also called "arborescences" when the edges are oriented away from the root. I typically use the following terminology:

A (rooted) tree is a directed graph with a designated root node and a unique path from that node to every other node.
The root node is the only node without an incoming edge. Every other node has a unique incoming edge from its parent. Nodes with the same parent are called siblings.
An ancestor of a node n is a node with a path to n.
An descendant of a node n is a node with a path from n.
A child of a node n is a node with an edge from n.
A leaf is a node without children.
A branch is a path from the root to a leaf.

Then we come to binary trees. Are these trees? Yes, with two different kinds of nodes:

binary nodes with a left and right child,
nullary leaf nodes.

However, very often, a conflicting terminology is used. If we ask Wikipedia what a binary tree is, it says: it is a tree data structure where every node has at most two children. So the nullary nodes from before are not counted as nodes, only the binary nodes, but they are treated as at-most-binary. Note that it would be wrong to say that a binary tree is a tree where every node has no children, exactly one child, or two (ordered) children; in case of exactly one child, one still has to make the difference between whether it is a left or right child. A more precise version would be: every node has an optional left child and an optional right child; additionally, the root node is optional.

Note that this terminology loses the connection to rooted trees. A rooted tree always has a root node, but in that terminology the root node is optional.

The reason for the second terminology may be that in binary search trees, data is only stored in binary nodes. The size of a binary search tree, as modelling a set or map, is the number of binary nodes. (But note that the same is not true for 2-3-trees; there, we have to count the binary nodes once and the ternary nodes twice.)

The biggest confusion comes with the term leaf. In the second convention, this does not refer to the empty nodes, but rather to binary nodes whose left and right child are empty (that is, nullary nodes). See here for the disconnect between binary trees and general trees in the course book.

Some thoughts:

In a functional language, one would define binary trees as an algebraic data type with two constructors. It is tempting to call the nullary constructor Leaf. There, the nullary case is more visible then in Java or Python, where one typically just uses the ever-present null value instead of defining a separate class for nullary nodes.
In the first convention, inserting a new value means replacing a leaf node with a binary node. In the second convention, it means putting a new leaf below a node (which may itself not be a leaf).
In the first convention, binary search always results in a node. It is a binary node if the value is found and a nullary node otherwise. The nullary case is useful because it shows the insertion location for that value.
The notion of leaf seems more fundamental/useful in the first convention.
The notion of branch is less obvious in the second terminology. Do we mean a path from the (optional) root to a leaf or additionally the information of whether to go left or right from the leaf? The second version is the more useful one.
It is not clear how to draw an empty binary search tree in the second convention. The memory representation is not empty, it still consists of the root pointer, but that is not reflected in an empty drawing.
Confusingly, there is also the notion of full binary tree, in which every node is a leaf node or has exactly two children according to the second convention. Read in the first convention, this would just be the definition of binary trees! According to the course book, they are used for Huffman coding, putting the data in the leaves. But with the first convention, we don't need this notion: we just store the data in the true leaf nodes of an arbitrary binary tree. This also reveals a bug in the notion of full binary tree: the root node should exist (be a binary node).

Some suggestions:

Follow the first convention: leaf means the leaf node in the sense of trees.
Use another term for the other perspective. For example, data node and data leaf.
Clearly establish the drawings of binary search trees with omitted empty nodes as abuse of notation.
Eradicate the notion of full binary tree. I don't think it is useful.

heatherleaf commented 4 months ago

I think we should teach both, unfortunately.

Your suggestion is the logically cleanest. But then students will be confused when they read Wikipedia and everything else about BSTs in books and on the web. But at the same time, your suggestion maps very well with R/B trees, so I think there's a point in teaching that too.

I suggest to not call null nodes "leaves". That will just make students really really confused. Instead just call them "null nodes" (or "nullary"). Use "leaves" for binary nodes with only null nodes as children. The word "children" will be ambiguous - sometimes it will include null nodes, but most of the times it will mean non-null children.

PS. Why do you use "nullary nodes" instead of "null nodes"?

heatherleaf commented 4 months ago

Regarding Haskell: I would use the constructor Leaf only if it has a value, otherwise I would call it None or Nil or something like that. But apparently people disagree:

sattlerc commented 4 months ago

Your suggestion is the logically cleanest. But then students will be confused when they read Wikipedia and everything else about BSTs in books and on the web.

Both conflicting terminologies already appear in the literature (for example, Knuth uses leaf in the original sense). Wikipedia is inconsistent (see Figure 1 for binary search tree).

I suspect the use of leaf for certain binary nodes arose from an abuse of notation: not drawing the nullary nodes (true leaf nodes) in pictures of binary trees. Now the binary nodes with only nullary children look like leafs. (There are sources that always draw the nullary nodes, using a different symbol.)

Whatever choice we make perpetuates the chosen convention, so we have some responsibility. And clearly, a warning is warranted.

But at the same time, your suggestion maps very well with R/B trees

I don't understand what you mean here. Red-black trees are binary search trees, so the terminology should be the same.

PS. Why do you use "nullary nodes" instead of "null nodes"?

Because some types of binary trees store data at the nullary nodes, for example Huffman trees. (Also, null evoces the connotation null pointer in Java, which is a language artifact).

In general, the arity of a node is how many children it has (nullary, binary, ternary).

heatherleaf commented 4 months ago

But at the same time, your suggestion maps very well with R/B trees

I don't understand what you mean here. Red-black trees are binary search trees, so the terminology should be the same.

I meant that we explicitly have to color the nullary nodes black. But for other BSTs we don't need to think about null nodes (we can e.g. say that a node doesn't have a left child)

heatherleaf commented 4 months ago

Actually, that's something I would like to be able to say - that a BST node doesn't have a left/right child. Because that's what we use when deleting.

ChalmersGU-data-structure-courses / OpenDSA

Tree terminology #4