Print the feature names in `report.print_tree()`

ablaom commented 2 years ago

This is actually possible, because DecisionTree.print_tree() has an option to pass the feature names: https://github.com/bensadeghi/DecisionTree.jl/blob/3fcb5b083e9abf45773ad1f22945473a7cc4ef89/src/DecisionTree.jl#L86

cc @roland-KA

adarshpalaskar1 commented 7 months ago

Hello, can I work on this issue?

I modified the TreePrinter struct and fit function to include the feature_names parameter.

Running the example from the documentation https://docs.juliahub.com/MLJDecisionTreeInterface/QLzS8/0.2.5/autodocs/#MLJDecisionTreeInterface.DecisionTreeClassifier

Current output:

julia> report(mach).print_tree(3)
Feature 4 < 0.8 ?
├─ 1 : 50/50
└─ Feature 4 < 1.75 ?
    ├─ Feature 3 < 4.95 ?
        ├─
        └─
    └─ Feature 3 < 4.85 ?
        ├─
        └─ 3 : 43/43

julia> report(mach).print_tree(6)
Feature 4 < 0.8 ?
├─ 1 : 50/50
└─ Feature 4 < 1.75 ?
    ├─ Feature 3 < 4.95 ?
        ├─ Feature 4 < 1.65 ?
            ├─ 2 : 47/47
            └─ 3 : 1/1
        └─ Feature 4 < 1.55 ?
            ├─ 3 : 3/3
            └─ 2 : 2/3
    └─ Feature 3 < 4.85 ?
        ├─ Feature 2 < 3.1 ?
            ├─ 3 : 2/2
            └─ 2 : 1/1
        └─ 3 : 43/43

New output:

julia> report(mach).print_tree(3)
Feature 4: "petal_width" < 0.8 ?
├─ 1 : 50/50
└─ Feature 4: "petal_width" < 1.75 ?
    ├─ Feature 3: "petal_length" < 4.95 ?
        ├─
        └─
    └─ Feature 3: "petal_length" < 4.85 ?
        ├─
        └─ 3 : 43/43

julia> report(mach).print_tree(6)
Feature 4: "petal_width" < 0.8 ?
├─ 1 : 50/50
└─ Feature 4: "petal_width" < 1.75 ?
    ├─ Feature 3: "petal_length" < 4.95 ?
        ├─ Feature 4: "petal_width" < 1.65 ?
            ├─ 2 : 47/47
            └─ 3 : 1/1
        └─ Feature 4: "petal_width" < 1.55 ?
            ├─ 3 : 3/3
            └─ 2 : 2/3
    └─ Feature 3: "petal_length" < 4.85 ?
        ├─ Feature 1: "sepal_length" < 5.95 ?
            ├─ 2 : 1/1
            └─ 3 : 2/2
        └─ 3 : 43/43

Is the new output consistent with the required output of this issue? Please let me know if any further changes are required.

roland-KA commented 6 months ago

This looks good to me with respect to the feature names.

The only strange thing is, that the last part of the decision tree is different in the new output example (Feature 2 < 3.1 vs. Feature 1 < 5.95). This shouldn't be the case if the same data and the same algorithm has been used.

Current output:

└─ Feature 3 < 4.85 ?
        ├─ Feature 2 < 3.1 ?
            ├─ 3 : 2/2
            └─ 2 : 1/1
        └─ 3 : 43/43

New output:

└─ Feature 3: "petal_length" < 4.85 ?
        ├─ Feature 1: "sepal_length" < 5.95 ?
            ├─ 2 : 1/1
            └─ 3 : 2/2
        └─ 3 : 43/43

adarshpalaskar1 commented 6 months ago

Yes, I think this is because of tie breaks while selecting the feature. Since both conditions (Feature 2 < 3.1 vs. Feature 1 < 5.95) are giving us the same output,

├─ 3 : 2/2
└─ 2 : 1/1

they have the same entropy/gini index etc. metric scores. In such cases, the algorithm may pick a random feature/ feature that occurred first during the iteration. I think this could be a possible reason for the observed difference.

Also, I re-executed the code for the new output:

julia> report(mach).print_tree(6)
Feature 4: "petal_width" < 0.8 ?
├─ 1 : 50/50
└─ Feature 4: "petal_width" < 1.75 ?
    ├─ Feature 3: "petal_length" < 4.95 ?
        ├─ Feature 4: "petal_width" < 1.65 ?
            ├─ 2 : 47/47
            └─ 3 : 1/1
        └─ Feature 4: "petal_width" < 1.55 ?
            ├─ 3 : 3/3
            └─ 2 : 2/3
    └─ Feature 3: "petal_length" < 4.85 ?
        ├─ Feature 2: "sepal_width" < 3.1 ?
            ├─ 3 : 2/2
            └─ 2 : 1/1
        └─ 3 : 43/43

which is now the same as the current output.

Let me know if it is okay or if I should dig deeper.

roland-KA commented 6 months ago

Ah, I think that explains the situation well. So everything seems to work perfect! 👍

ablaom commented 6 months ago

closed by #54

JuliaAI / MLJDecisionTreeInterface.jl

Print the feature names in `report.print_tree()` #23