This adds the ability for continuous variables to be supplied as the dependent variable and it calculates the p-value of the different continuous sets using either the Bartlett's sig test (if the original distribution is normal) or the Levene's sig test if not, this is as opposed to the chi-squared test when the dependent variable is categorical. The output is essentially the same:
python -m CHAID tests/data/titanic.csv survived sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05 --dependent-variable-type continuous
/Users/Mark/anaconda/envs/quattro8/lib/python2.7/site-packages/numpy/lib/arraysetops.py:200: FutureWarning: numpy not_equal will not check object identity in the future. The comparison did not return the same result as suggested by the identity (`is`)) and will change.
flag = np.concatenate(([True], aux[1:] != aux[:-1]))
([], {'s.t.d': 0.48586947642957506, 'mean': 0.3819709702062643}, (sex, p=0.000638905011409, score=11.7157182334, groups=[['female'], ['male']]), dof=1))
├── (['female'], {'s.t.d': 0.44526216422083886, 'mean': 0.72746781115879833}, (embarked, p=7.03229898206e-07, score=25.2982409725, groups=[['C', '<missing>'], ['Q', 'S']]), dof=1))
│ ├── (['C', '<missing>'], {'s.t.d': 0.29411364391804806, 'mean': 0.90434782608695652}, <Invalid Chaid Split>)
│ └── (['Q', 'S'], {'s.t.d': 0.47038754000149097, 'mean': 0.66951566951566954}, <Invalid Chaid Split>)
└── (['male'], {'s.t.d': 0.39307692569404135, 'mean': 0.19098457888493475}, (embarked, p=4.72697297824e-05, score=16.7286102795, groups=[['C'], ['Q', 'S']]), dof=1))
├── (['C'], {'s.t.d': 0.46071697630637259, 'mean': 0.30573248407643311}, <Invalid Chaid Split>)
└── (['Q', 'S'], {'s.t.d': 0.37093039074150586, 'mean': 0.16472303206997085}, <Invalid Chaid Split>)
python -m CHAID tests/data/titanic.csv survived sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05
/Users/Mark/anaconda/envs/quattro8/lib/python2.7/site-packages/numpy/lib/arraysetops.py:200: FutureWarning: numpy not_equal will not check object identity in the future. The comparison did not return the same result as suggested by the identity (`is`)) and will change.
flag = np.concatenate(([True], aux[1:] != aux[:-1]))
([], {0: 809, 1: 500}, (sex, p=1.47145310169e-81, score=365.886947811, groups=[['female'], ['male']]), dof=1))
├── (['female'], {0: 127, 1: 339}, (embarked, p=9.17624191599e-07, score=24.0936494474, groups=[['C', '<missing>'], ['Q', 'S']]), dof=1))
│ ├── (['C', '<missing>'], {0: 11, 1: 104}, <Invalid Chaid Split>)
│ └── (['Q', 'S'], {0: 116, 1: 235}, <Invalid Chaid Split>)
└── (['male'], {0: 682, 1: 161}, (embarked, p=5.017855245e-05, score=16.4413525404, groups=[['C'], ['Q', 'S']]), dof=1))
├── (['C'], {0: 109, 1: 48}, <Invalid Chaid Split>)
└── (['Q', 'S'], {0: 573, 1: 113}, <Invalid Chaid Split>)
Architectural shifts
Node has `score` not `chi` due to the new sig tests
Column now holds the weight (as the weight applies to the dependent variable)
new Stats class that understands the different column types and the stats that should be applied
Still need to write some documentation
Python 3 tests are failing (thank-you tox)
This adds the ability for continuous variables to be supplied as the dependent variable and it calculates the p-value of the different continuous sets using either the Bartlett's sig test (if the original distribution is normal) or the Levene's sig test if not, this is as opposed to the chi-squared test when the dependent variable is categorical. The output is essentially the same:
Architectural shifts
Still need to write some documentationPython 3 tests are failing (thank-you tox)