Add strings to the branches ?

sylvaticus commented 1 year ago

Hello, just tested your work on BetaML, it works great. Only one minor point.. could be possible in TreeRecipe to add a vertical text on the two sides, something like "False branch" and "True branch" ? Or a "no","yes" on the side of the first split ? Also, this is personal, but I feel more intuitive to have the false branch on the left and the true branch on the right, although I think scikit also plot false on the right...

roland-KA commented 1 year ago

Interesting points; let's first go to the last one, because we had a similar dicussions when implementing the AbstractTrees interface for DecsionTrees.jl 🤓

Order of branches

The plot recipe doesn't imply or force any order on the branches. It plots the data structure just "as is" ... or to be more exact: as children delivers the child nodes.

The implementation of children within the BetaML-package delivers first the trueBranch, then the falseBranch. So if we want that order reversed, we simply have to change the implementation of children.

And of course it is also desirable to have some sort of standard. In the DecisionTrees.jl-implementation we agreed on "true on the left side" and "false on the right side". And a quick search on the web showed me, that most decision tree plots use the same order. So I think it would be a good idea to adapt BetaML in the same way.

Shall I create a PR or would you like to to it by yourself?

Labels on the connection lines

To have labels on the connection lines for each branch would be a nice enhancement of the plot recipe. But we have to consider, that a decision tree is not necessarily a binary tree.

E.g. the famous weather example shows nicely that in the general case each node may have an arbitrary number of children and (what is more important here) its own label for each connection line.

I.e. for a tree with $n$ nodes we have $n-1$ connection lines and thus need $n-1$ labels.

We could add that information in the same way we add feature names or class labels. I.e. the info parameter of wrap would get an additional argument (e.g. connector_labels).

In case of a binary tree with $n$ nodes, labels could be easily constructed using repeat:

connector_labels = repeat(["yes", "no"], (n-1)/2)

Getting this information via wrap, I could extend the plot recipe to plot these labels on the connection lines.

Would that be an apropriate solution for your issue?

@ablaom: Any comments from your side on these points?

ablaom commented 1 year ago

Only that as the CART decision tree algorithm always delivers a binary tree, the use cases for supporting n-ary tree visualisation in machine learning are probably a bit fringe. So not a priority from my ML viewpoint.

roland-KA commented 1 year ago

Good point! Then it would be desirable to have a simpler interface for this rather common case. I.e. I would introduce a separate parameter (which takes exactly two labels) for this case like

connector_labels_binary = ["yes", "no"]

and do the repeat stuff inside the recipe (because there I need $n-1$ labels anyways)

... and a parameter for the general case like

connector_labels_nary = ["labe1", "label2", ..., "labe n-1"]

where you can pass a label for each of the $n-1$ connectors.

Would this be a better solution?

ablaom commented 1 year ago

Actually, it looks simpler to me to stick with a single parameter and avoid the case distinction.

roland-KA commented 1 year ago

This would leave in the binary case the creation of a list with $n-1$ entries (using repeat) to the caller (which I wanted to avoid, since it is such a common case).

To improve that situation, I could do the following:

if the list contains exactly two entries, then these are the labels for a binary tree and I can do the repeat within the recipe
in all other cases the list has to contain $n-1$ entries

ablaom commented 1 year ago

Even in the binary case we might want custom labels. For example, if we are branching based on an unordered categorical feature (unsupported by DecisionTree, but supported by BetaML) then we may want to label each branch with the subset of class-values corresponding to each branch. For example: If color in [:red, :green] branch left; if color in [:yellow, :black, :blue] branch right. Such nodes are not just "yes/no" answers to the question " x < y?" which we have for the standard ordered features. So we may want different pairs of labels for each non-leaf node. Does that make sense?

roland-KA commented 1 year ago

Oh, that's a situation I hadn't in mind at all. Thanks!

So I agree, that we should use a single parameter. I will then just mention in the docs that the 'simple' binary case can be easily expressed using repeat (as described above).

roland-KA commented 1 year ago

It took some time 🙈, but I've finally managed to implement custom labels for the tree (see the PR #8, I've just uploaded).

ablaom commented 1 year ago

Now that I work through the example, I can see it is not ideal that the user needs to find out the tree size to get the number of repeats.

From the example:

num_lines = AbstractTrees.treesize(wt) - 1    # the tree has #nodes - 1 connector lines
p2 = plot(wt, 0.8, 0.7; size = (1400,600), connector_labels = repeat(["yes", "no"], num_lines ÷ 2))

What if the algorithm takes connect_labels of any size and just repeats (cycles through) as necessary, ie, whenever the provided labels are exhausted before finishing the tree? That way, the above works (as does a complete explicit list of all labels) but so does:

p2 = plot(wt, 0.8, 0.7; size = (1400,600), connector_labels = ["yes", "no"])

ablaom commented 1 year ago

It does look very nice, BTW.

roland-KA commented 1 year ago

What if the algorithm takes connect_labels of any size and just repeats (cycles through) as necessary, ie, whenever the provided labels are exhausted before finishing the tree?

That would be a solution, but it doesn't completely free the user from knowing the number of nodes in the case where each connector line gets its own label. In that case he has to provide at least #nodes-1 labels.

What I'm thinking about is: Don't we have in practice just two use-cases?

Two labels which are repeated on each subtree if the nodes contain a boolean condition
nodes-1 labels in all other cases

Do you have perhaps some examples for the second use case? How does a user obtain the labels in this case?

roland-KA commented 10 months ago

As there was no feedback to my last question, I've now just implemented the proposal of @ablaom. I.e. the algorithm now just cycles through the available labels for the connecting lines. So the interface to the plot recipe gets simpler and easier to use. There is an example in examples/DecisionTree_iris.jl.

JuliaAI / TreeRecipe.jl

Add strings to the branches ? #7

Order of branches

Labels on the connection lines

nodes-1 labels in all other cases