Rambatino / CHAID

A python implementation of the common CHAID algorithm
Apache License 2.0
149 stars 50 forks source link

BestTree class that aims to calculate the best possible tree, as well as the needed predict method #91

Closed Rambatino closed 5 years ago

Rambatino commented 5 years ago

This PR contains a class called BestTree, that takes in predictors and target and uses hyperopts to find the best set of params using train/test split for the data.

It is used as follows:

>>> best_config = BestTree(df[ind_vars].values, df[dep_var].values, 30, split_titles=ind_vars).calculate()
{'alpha_merge': 0.42586970334320423, 'max_depth': 6, 'min_child_node_size': 32, 'min_parent_node_size': 31}
>>> # use this to create a new tree
>>> tree = Tree.from_pandas_df(data, types, nspace.dependent_variable[0],
                                   **best_config)

This involved creating the function .predict(), which applies the tree's model to a data set. E.g.:

>>> tree.predict(data[list(types.keys())].values[: 10, :])
array([1., 0., 1., 0., 1., 0., 1., 0., 1., 0.])

It comes with a way of running it from the command line.

Run with standard params:

➜ python -m CHAID tests/data/titanic.csv survived sex pclass embarked
([], {0: 809.0, 1: 500.0}, (sex, p=1.47145310169e-81, score=365.886947811, groups=[['female'], ['male']]), dof=1))
|-- (['female'], {0: 127.0, 1: 339.0}, (pclass, p=7.50718706569e-26, score=115.702703161, groups=[[1], [2], [3]]), dof=2))
|   |-- ([1], {0: 5.0, 1: 139.0}, <Invalid Chaid Split> - the max depth has been reached)
|   |-- ([2], {0: 12.0, 1: 94.0}, <Invalid Chaid Split> - the max depth has been reached)
|   +-- ([3], {0: 110.0, 1: 106.0}, <Invalid Chaid Split> - the max depth has been reached)
+-- (['male'], {0: 682.0, 1: 161.0}, (pclass, p=9.19686249011e-09, score=33.0040176609, groups=[[1], [2, 3]]), dof=1))
    |-- ([1], {0: 118.0, 1: 61.0}, <Invalid Chaid Split> - the max depth has been reached)
    +-- ([2, 3], {0: 564.0, 1: 100.0}, <Invalid Chaid Split> - the max depth has been reached)

('Accuracy: ', 0.7830404889228418)

Find better tree:

➜ python -m CHAID tests/data/titanic.csv survived sex pclass embarked --find-best --n 100
Finding best CHAID params: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [03:43<00:00,  2.26s/it]
('Best config: ', {'min_parent_node_size': 53, 'alpha_merge': 0.1784422075047269, 'max_depth': 6, 'min_child_node_size': 23})
([], {0: 809.0, 1: 500.0}, (sex, p=1.47145310169e-81, score=365.886947811, groups=[['female'], ['male']]), dof=1))
|-- (['female'], {0: 127.0, 1: 339.0}, (pclass, p=7.50718706569e-26, score=115.702703161, groups=[[1], [2], [3]]), dof=2))
|   |-- ([1], {0: 5.0, 1: 139.0}, <Invalid Chaid Split> - the node only contains single category respondents)
|   |-- ([2], {0: 12.0, 1: 94.0}, <Invalid Chaid Split> - the node only contains single category respondents)
|   +-- ([3], {0: 110.0, 1: 106.0}, (embarked, p=0.000638052386506, score=11.6615476457, groups=[['C', 'Q'], ['S']]), dof=1))
|       |-- (['C', 'Q'], {0: 32.0, 1: 55.0}, <Invalid Chaid Split> - the node only contains single category respondents)
|       +-- (['S'], {0: 78.0, 1: 51.0}, <Invalid Chaid Split> - the node only contains single category respondents)
+-- (['male'], {0: 682.0, 1: 161.0}, (pclass, p=9.19686249011e-09, score=33.0040176609, groups=[[1], [2, 3]]), dof=1))
    |-- ([1], {0: 118.0, 1: 61.0}, <Invalid Chaid Split> - the node only contains single category respondents)
    +-- ([2, 3], {0: 564.0, 1: 100.0}, (embarked, p=0.0265545221538, score=4.91954418577, groups=[['C'], ['Q', 'S']]), dof=1))
        |-- (['C'], {0: 67.0, 1: 20.0}, <Invalid Chaid Split> - the node only contains single category respondents)
        +-- (['Q', 'S'], {0: 497.0, 1: 80.0}, <Invalid Chaid Split> - the node only contains single category respondents)

('Accuracy: ', 0.80061115355233)
Rambatino commented 5 years ago

Interestingly, because the tree will stop splitting if one of the params is reached e.g. min parent node size, it could yield: max depth 28 with min parent node size 38, the latter take precedent. If you then applied it to a much much larger unseen dataset, the former - max depth - could have more of an impact.

VivianMagri commented 5 years ago

Why was that canceled?

Rambatino commented 5 years ago

@VivianMagri it was cancelled because it became stale and there didn't seem to be much impetus for this functionality. What functionality are you after?