Rambatino / CHAID

A python implementation of the common CHAID algorithm
Apache License 2.0
150 stars 50 forks source link

Reverted back heuristic approach #84

Closed Rambatino closed 6 years ago

Rambatino commented 6 years ago

Shifted back to the heuristic way, optimising for low sample sizes.

codecov[bot] commented 6 years ago

Codecov Report

Merging #84 into master will increase coverage by 0.17%. The diff coverage is 93.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #84      +/-   ##
==========================================
+ Coverage    92.8%   92.97%   +0.17%     
==========================================
  Files           7        7              
  Lines         514      555      +41     
==========================================
+ Hits          477      516      +39     
- Misses         37       39       +2
Impacted Files Coverage Δ
CHAID/stats.py 97.59% <93.65%> (-0.81%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5d7b4dd...c48b079. Read the comment docs.

Rambatino commented 6 years ago

For the edge case:

ipdb>
([], {0: 13.0, 1: 6.0, 2: 1.0}, (age, p=0.025874595652, score=7.30898730899, groups=[['0-5'], ['6-10', '11-15']]), dof=2))
├── (['0-5'], {0: 10.0, 1: 1.0, 2: 0}, <Invalid Chaid Split> - the minimum parent node size threshold has been reached)
└── (['6-10', '11-15'], {0: 3.0, 1: 5.0, 2: 1.0}, <Invalid Chaid Split> - the minimum parent node size threshold has been reached)

image

Rambatino commented 6 years ago

Actually, we seem to be missing a tree level...

Rambatino commented 6 years ago

Ahh the min parent node size was wrong, here it is with 20:

image