Open ablaom opened 2 years ago
Hello, can I work on this issue?
I modified the TreePrinter struct and fit function to include the feature_names
parameter.
Running the example from the documentation https://docs.juliahub.com/MLJDecisionTreeInterface/QLzS8/0.2.5/autodocs/#MLJDecisionTreeInterface.DecisionTreeClassifier
Current output:
julia> report(mach).print_tree(3)
Feature 4 < 0.8 ?
├─ 1 : 50/50
└─ Feature 4 < 1.75 ?
├─ Feature 3 < 4.95 ?
├─
└─
└─ Feature 3 < 4.85 ?
├─
└─ 3 : 43/43
julia> report(mach).print_tree(6)
Feature 4 < 0.8 ?
├─ 1 : 50/50
└─ Feature 4 < 1.75 ?
├─ Feature 3 < 4.95 ?
├─ Feature 4 < 1.65 ?
├─ 2 : 47/47
└─ 3 : 1/1
└─ Feature 4 < 1.55 ?
├─ 3 : 3/3
└─ 2 : 2/3
└─ Feature 3 < 4.85 ?
├─ Feature 2 < 3.1 ?
├─ 3 : 2/2
└─ 2 : 1/1
└─ 3 : 43/43
New output:
julia> report(mach).print_tree(3)
Feature 4: "petal_width" < 0.8 ?
├─ 1 : 50/50
└─ Feature 4: "petal_width" < 1.75 ?
├─ Feature 3: "petal_length" < 4.95 ?
├─
└─
└─ Feature 3: "petal_length" < 4.85 ?
├─
└─ 3 : 43/43
julia> report(mach).print_tree(6)
Feature 4: "petal_width" < 0.8 ?
├─ 1 : 50/50
└─ Feature 4: "petal_width" < 1.75 ?
├─ Feature 3: "petal_length" < 4.95 ?
├─ Feature 4: "petal_width" < 1.65 ?
├─ 2 : 47/47
└─ 3 : 1/1
└─ Feature 4: "petal_width" < 1.55 ?
├─ 3 : 3/3
└─ 2 : 2/3
└─ Feature 3: "petal_length" < 4.85 ?
├─ Feature 1: "sepal_length" < 5.95 ?
├─ 2 : 1/1
└─ 3 : 2/2
└─ 3 : 43/43
Is the new output consistent with the required output of this issue? Please let me know if any further changes are required.
This looks good to me with respect to the feature names.
The only strange thing is, that the last part of the decision tree is different in the new output example (Feature 2 < 3.1
vs. Feature 1 < 5.95
). This shouldn't be the case if the same data and the same algorithm has been used.
Current output:
└─ Feature 3 < 4.85 ?
├─ Feature 2 < 3.1 ?
├─ 3 : 2/2
└─ 2 : 1/1
└─ 3 : 43/43
New output:
└─ Feature 3: "petal_length" < 4.85 ?
├─ Feature 1: "sepal_length" < 5.95 ?
├─ 2 : 1/1
└─ 3 : 2/2
└─ 3 : 43/43
Yes, I think this is because of tie breaks while selecting the feature. Since both conditions (Feature 2 < 3.1 vs. Feature 1 < 5.95) are giving us the same output,
├─ 3 : 2/2
└─ 2 : 1/1
they have the same entropy/gini index etc. metric scores. In such cases, the algorithm may pick a random feature/ feature that occurred first during the iteration. I think this could be a possible reason for the observed difference.
Also, I re-executed the code for the new output:
julia> report(mach).print_tree(6)
Feature 4: "petal_width" < 0.8 ?
├─ 1 : 50/50
└─ Feature 4: "petal_width" < 1.75 ?
├─ Feature 3: "petal_length" < 4.95 ?
├─ Feature 4: "petal_width" < 1.65 ?
├─ 2 : 47/47
└─ 3 : 1/1
└─ Feature 4: "petal_width" < 1.55 ?
├─ 3 : 3/3
└─ 2 : 2/3
└─ Feature 3: "petal_length" < 4.85 ?
├─ Feature 2: "sepal_width" < 3.1 ?
├─ 3 : 2/2
└─ 2 : 1/1
└─ 3 : 43/43
which is now the same as the current output.
Let me know if it is okay or if I should dig deeper.
Ah, I think that explains the situation well. So everything seems to work perfect! 👍
closed by #54
This is actually possible, because
DecisionTree.print_tree()
has an option to pass the feature names: https://github.com/bensadeghi/DecisionTree.jl/blob/3fcb5b083e9abf45773ad1f22945473a7cc4ef89/src/DecisionTree.jl#L86cc @roland-KA