Rambatino / CHAID

A python implementation of the common CHAID algorithm
Apache License 2.0
150 stars 50 forks source link

Added Levene & Bartlett test for continuous dependent variables #48

Closed Rambatino closed 7 years ago

Rambatino commented 7 years ago

This adds the ability for continuous variables to be supplied as the dependent variable and it calculates the p-value of the different continuous sets using either the Bartlett's sig test (if the original distribution is normal) or the Levene's sig test if not, this is as opposed to the chi-squared test when the dependent variable is categorical. The output is essentially the same:

python -m CHAID tests/data/titanic.csv survived sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05 --dependent-variable-type continuous

/Users/Mark/anaconda/envs/quattro8/lib/python2.7/site-packages/numpy/lib/arraysetops.py:200: FutureWarning: numpy not_equal will not check object identity in the future. The comparison did not return the same result as suggested by the identity (`is`)) and will change.
  flag = np.concatenate(([True], aux[1:] != aux[:-1]))
([], {'s.t.d': 0.48586947642957506, 'mean': 0.3819709702062643}, (sex, p=0.000638905011409, score=11.7157182334, groups=[['female'], ['male']]), dof=1))
├── (['female'], {'s.t.d': 0.44526216422083886, 'mean': 0.72746781115879833}, (embarked, p=7.03229898206e-07, score=25.2982409725, groups=[['C', '<missing>'], ['Q', 'S']]), dof=1))
│   ├── (['C', '<missing>'], {'s.t.d': 0.29411364391804806, 'mean': 0.90434782608695652}, <Invalid Chaid Split>)
│   └── (['Q', 'S'], {'s.t.d': 0.47038754000149097, 'mean': 0.66951566951566954}, <Invalid Chaid Split>)
└── (['male'], {'s.t.d': 0.39307692569404135, 'mean': 0.19098457888493475}, (embarked, p=4.72697297824e-05, score=16.7286102795, groups=[['C'], ['Q', 'S']]), dof=1))
    ├── (['C'], {'s.t.d': 0.46071697630637259, 'mean': 0.30573248407643311}, <Invalid Chaid Split>)
    └── (['Q', 'S'], {'s.t.d': 0.37093039074150586, 'mean': 0.16472303206997085}, <Invalid Chaid Split>)

python -m CHAID tests/data/titanic.csv survived sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05

/Users/Mark/anaconda/envs/quattro8/lib/python2.7/site-packages/numpy/lib/arraysetops.py:200: FutureWarning: numpy not_equal will not check object identity in the future. The comparison did not return the same result as suggested by the identity (`is`)) and will change.
  flag = np.concatenate(([True], aux[1:] != aux[:-1]))
([], {0: 809, 1: 500}, (sex, p=1.47145310169e-81, score=365.886947811, groups=[['female'], ['male']]), dof=1))
├── (['female'], {0: 127, 1: 339}, (embarked, p=9.17624191599e-07, score=24.0936494474, groups=[['C', '<missing>'], ['Q', 'S']]), dof=1))
│   ├── (['C', '<missing>'], {0: 11, 1: 104}, <Invalid Chaid Split>)
│   └── (['Q', 'S'], {0: 116, 1: 235}, <Invalid Chaid Split>)
└── (['male'], {0: 682, 1: 161}, (embarked, p=5.017855245e-05, score=16.4413525404, groups=[['C'], ['Q', 'S']]), dof=1))
    ├── (['C'], {0: 109, 1: 48}, <Invalid Chaid Split>)
    └── (['Q', 'S'], {0: 573, 1: 113}, <Invalid Chaid Split>)

Architectural shifts

Still need to write some documentation Python 3 tests are failing (thank-you tox)