Closed sylvaticus closed 10 months ago
Interesting points; let's first go to the last one, because we had a similar dicussions when implementing the AbstractTrees
interface for DecsionTrees.jl
🤓
The plot recipe doesn't imply or force any order on the branches. It plots the data structure just "as is" ... or to be more exact: as children
delivers the child nodes.
The implementation of children
within the BetaML
-package delivers first the trueBranch
, then the falseBranch
. So if we want that order reversed, we simply have to change the implementation of children
.
And of course it is also desirable to have some sort of standard. In the DecisionTrees.jl
-implementation we agreed on "true on the left side" and "false on the right side". And a quick search on the web showed me, that most decision tree plots use the same order.
So I think it would be a good idea to adapt BetaML
in the same way.
Shall I create a PR or would you like to to it by yourself?
To have labels on the connection lines for each branch would be a nice enhancement of the plot recipe. But we have to consider, that a decision tree is not necessarily a binary tree.
E.g. the famous weather example shows nicely that in the general case each node may have an arbitrary number of children and (what is more important here) its own label for each connection line.
I.e. for a tree with $n$ nodes we have $n-1$ connection lines and thus need $n-1$ labels.
We could add that information in the same way we add feature names or class labels. I.e. the info
parameter of wrap
would get an additional argument (e.g. connector_labels
).
In case of a binary tree with $n$ nodes, labels could be easily constructed using repeat
:
connector_labels = repeat(["yes", "no"], (n-1)/2)
Getting this information via wrap
, I could extend the plot recipe to plot these labels on the connection lines.
Would that be an apropriate solution for your issue?
@ablaom: Any comments from your side on these points?
Only that as the CART decision tree algorithm always delivers a binary tree, the use cases for supporting n-ary tree visualisation in machine learning are probably a bit fringe. So not a priority from my ML viewpoint.
Good point! Then it would be desirable to have a simpler interface for this rather common case. I.e. I would introduce a separate parameter (which takes exactly two labels) for this case like
connector_labels_binary = ["yes", "no"]
and do the repeat
stuff inside the recipe (because there I need $n-1$ labels anyways)
... and a parameter for the general case like
connector_labels_nary = ["labe1", "label2", ..., "labe n-1"]
where you can pass a label for each of the $n-1$ connectors.
Would this be a better solution?
Actually, it looks simpler to me to stick with a single parameter and avoid the case distinction.
This would leave in the binary case the creation of a list with $n-1$ entries (using repeat
) to the caller (which I wanted to avoid, since it is such a common case).
To improve that situation, I could do the following:
repeat
within the recipeEven in the binary case we might want custom labels. For example, if we are branching based on an unordered categorical feature (unsupported by DecisionTree, but supported by BetaML) then we may want to label each branch with the subset of class-values corresponding to each branch. For example: If color in [:red, :green]
branch left; if color in [:yellow, :black, :blue]
branch right. Such nodes are not just "yes/no" answers to the question " x < y?" which we have for the standard ordered features. So we may want different pairs of labels for each non-leaf node. Does that make sense?
Oh, that's a situation I hadn't in mind at all. Thanks!
So I agree, that we should use a single parameter. I will then just mention in the docs that the 'simple' binary case can be easily expressed using repeat
(as described above).
It took some time 🙈, but I've finally managed to implement custom labels for the tree (see the PR #8, I've just uploaded).
Now that I work through the example, I can see it is not ideal that the user needs to find out the tree size to get the number of repeats.
From the example:
num_lines = AbstractTrees.treesize(wt) - 1 # the tree has #nodes - 1 connector lines
p2 = plot(wt, 0.8, 0.7; size = (1400,600), connector_labels = repeat(["yes", "no"], num_lines ÷ 2))
What if the algorithm takes connect_labels
of any size and just repeats (cycles through) as necessary, ie, whenever the provided labels are exhausted before finishing the tree? That way, the above works (as does a complete explicit list of all labels) but so does:
p2 = plot(wt, 0.8, 0.7; size = (1400,600), connector_labels = ["yes", "no"])
It does look very nice, BTW.
What if the algorithm takes connect_labels of any size and just repeats (cycles through) as necessary, ie, whenever the provided labels are exhausted before finishing the tree?
That would be a solution, but it doesn't completely free the user from knowing the number of nodes in the case where each connector line gets its own label. In that case he has to provide at least #nodes-1 labels.
What I'm thinking about is: Don't we have in practice just two use-cases?
Do you have perhaps some examples for the second use case? How does a user obtain the labels in this case?
As there was no feedback to my last question, I've now just implemented the proposal of @ablaom. I.e. the algorithm now just cycles through the available labels for the connecting lines. So the interface to the plot recipe gets simpler and easier to use. There is an example in examples/DecisionTree_iris.jl
.
Hello, just tested your work on BetaML, it works great. Only one minor point.. could be possible in TreeRecipe to add a vertical text on the two sides, something like "False branch" and "True branch" ? Or a "no","yes" on the side of the first split ? Also, this is personal, but I feel more intuitive to have the false branch on the left and the true branch on the right, although I think scikit also plot false on the right...