BestTree class that aims to calculate the best possible tree, as well as the needed predict method

Rambatino commented 5 years ago

This PR contains a class called BestTree, that takes in predictors and target and uses hyperopts to find the best set of params using train/test split for the data.

It is used as follows:

>>> best_config = BestTree(df[ind_vars].values, df[dep_var].values, 30, split_titles=ind_vars).calculate()
{'alpha_merge': 0.42586970334320423, 'max_depth': 6, 'min_child_node_size': 32, 'min_parent_node_size': 31}
>>> # use this to create a new tree
>>> tree = Tree.from_pandas_df(data, types, nspace.dependent_variable[0],
                                   **best_config)

This involved creating the function .predict(), which applies the tree's model to a data set. E.g.:

>>> tree.predict(data[list(types.keys())].values[: 10, :])
array([1., 0., 1., 0., 1., 0., 1., 0., 1., 0.])

It comes with a way of running it from the command line.

Run with standard params:

➜ python -m CHAID tests/data/titanic.csv survived sex pclass embarked
([], {0: 809.0, 1: 500.0}, (sex, p=1.47145310169e-81, score=365.886947811, groups=[['female'], ['male']]), dof=1))
|-- (['female'], {0: 127.0, 1: 339.0}, (pclass, p=7.50718706569e-26, score=115.702703161, groups=[[1], [2], [3]]), dof=2))
|   |-- ([1], {0: 5.0, 1: 139.0}, <Invalid Chaid Split> - the max depth has been reached)
|   |-- ([2], {0: 12.0, 1: 94.0}, <Invalid Chaid Split> - the max depth has been reached)
|   +-- ([3], {0: 110.0, 1: 106.0}, <Invalid Chaid Split> - the max depth has been reached)
+-- (['male'], {0: 682.0, 1: 161.0}, (pclass, p=9.19686249011e-09, score=33.0040176609, groups=[[1], [2, 3]]), dof=1))
    |-- ([1], {0: 118.0, 1: 61.0}, <Invalid Chaid Split> - the max depth has been reached)
    +-- ([2, 3], {0: 564.0, 1: 100.0}, <Invalid Chaid Split> - the max depth has been reached)

('Accuracy: ', 0.7830404889228418)

Find better tree:

➜ python -m CHAID tests/data/titanic.csv survived sex pclass embarked --find-best --n 100
Finding best CHAID params: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [03:43<00:00,  2.26s/it]
('Best config: ', {'min_parent_node_size': 53, 'alpha_merge': 0.1784422075047269, 'max_depth': 6, 'min_child_node_size': 23})
([], {0: 809.0, 1: 500.0}, (sex, p=1.47145310169e-81, score=365.886947811, groups=[['female'], ['male']]), dof=1))
|-- (['female'], {0: 127.0, 1: 339.0}, (pclass, p=7.50718706569e-26, score=115.702703161, groups=[[1], [2], [3]]), dof=2))
|   |-- ([1], {0: 5.0, 1: 139.0}, <Invalid Chaid Split> - the node only contains single category respondents)
|   |-- ([2], {0: 12.0, 1: 94.0}, <Invalid Chaid Split> - the node only contains single category respondents)
|   +-- ([3], {0: 110.0, 1: 106.0}, (embarked, p=0.000638052386506, score=11.6615476457, groups=[['C', 'Q'], ['S']]), dof=1))
|       |-- (['C', 'Q'], {0: 32.0, 1: 55.0}, <Invalid Chaid Split> - the node only contains single category respondents)
|       +-- (['S'], {0: 78.0, 1: 51.0}, <Invalid Chaid Split> - the node only contains single category respondents)
+-- (['male'], {0: 682.0, 1: 161.0}, (pclass, p=9.19686249011e-09, score=33.0040176609, groups=[[1], [2, 3]]), dof=1))
    |-- ([1], {0: 118.0, 1: 61.0}, <Invalid Chaid Split> - the node only contains single category respondents)
    +-- ([2, 3], {0: 564.0, 1: 100.0}, (embarked, p=0.0265545221538, score=4.91954418577, groups=[['C'], ['Q', 'S']]), dof=1))
        |-- (['C'], {0: 67.0, 1: 20.0}, <Invalid Chaid Split> - the node only contains single category respondents)
        +-- (['Q', 'S'], {0: 497.0, 1: 80.0}, <Invalid Chaid Split> - the node only contains single category respondents)

('Accuracy: ', 0.80061115355233)

Rambatino commented 5 years ago

Interestingly, because the tree will stop splitting if one of the params is reached e.g. min parent node size, it could yield: max depth 28 with min parent node size 38, the latter take precedent. If you then applied it to a much much larger unseen dataset, the former - max depth - could have more of an impact.

VivianMagri commented 5 years ago

Why was that canceled?

Rambatino commented 5 years ago

@VivianMagri it was cancelled because it became stale and there didn't seem to be much impetus for this functionality. What functionality are you after?

Rambatino / CHAID

BestTree class that aims to calculate the best possible tree, as well as the needed predict method #91