Rambatino / CHAID

A python implementation of the common CHAID algorithm
Apache License 2.0
150 stars 50 forks source link

Basic clarifications #66

Closed pennylong123 closed 7 years ago

pennylong123 commented 7 years ago

I am new to the CHAID algorithm in general and also to this package, so could you please give me a little clarification to what certain things mean? From what I understand so far, the dataset is a numpy array containing numbers, with several independent variables and one dependent variable. Then, constructing a tree gives us the CHAID output. Is this correct so far? For the actual tree, could someone help me understand what this output means in the context of the example in the README?

([], {1: 5, 2: 5}, ('a', p=0.001565402258, score=10.0, groups=[[1], [2]]), dof=1))
├── ([1], {1: 5, 2: 0}, <Invalid Chaid Split>)
└── ([2], {1: 0, 2: 5}, <Invalid Chaid Split>)

(I know these are all simple questions but I am new to all of this so would like to make sure I understand the basics)

Thank you

Rambatino commented 7 years ago

Do you know @asram6 - he asked pretty much this same question earlier today.

I'll update the README now.

Rambatino commented 7 years ago

Hi @pennylong123 please have a look at:

https://github.com/Rambatino/CHAID/pull/67/files?short_path=04c6e90#diff-04c6e90faac2675aa89e2176d2eec7d8

And give your opinions.

https://github.com/Rambatino/CHAID/pull/67

pennylong123 commented 7 years ago

Thank you! Also, what does 'invalid CHAID split' mean? I thought it was when the subset at a certain node is "pure" and so cannot be split any further. However, I have tried some examples of my own and gotten that message. For example, using the following data gives me ([], {0: 5, 1: 5}, <Invalid Chaid Split>): a b c 1 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0

Why can't it split on variable a or b?

Rambatino commented 7 years ago

It just means that certain thresholds have not been met yet. So it's reached max depth, or gone below minimum thresholds for a node. You can change these to make it split more. Eg.

https://github.com/Rambatino/CHAID/commit/53f6aae255774b25444955f5ce45d645ff168871

:thinking: maybe a more descriptive reason as to why it can't split would be appropriate

pennylong123 commented 7 years ago

Yes, I changed the min_child_node_size, but there seem to be other thresholds that are stopping it from splitting. You mentioned max depth -- are there other thresholds too? And how can I set all of them when I am constructing the Tree?

Rambatino commented 7 years ago

See here for the parameters you can pass into the tree building algorithm:

https://github.com/Rambatino/CHAID#parameters

Rambatino commented 7 years ago

But also, with such small data, it's unlikely that it will find anything to split on at any meaningful level.

Rambatino commented 7 years ago

On a final note, you can run:

>>> import CHAID
>>> help(CHAID.Tree) # or 
>>> help(CHAID.Split)

and it will give you:

help on class Split in module CHAID.split:

class Split(__builtin__.object)
 |  A potential split for a node in to produce children
 |
 |  Parameters
 |  ----------
 |  column : float
 |      The key of where the split is occuring relative to the input data
 |  splits : array-like
 |      The grouped variables
 |  split_map : array-like
 |      The name of the grouped variables
 |  score : float
 |      The score value of that split
 |  p : float
 |      The p value of that split
 |  dof : int
 |      The degrees of freedom as a result of this split
 |
 |  Methods defined here:
 |
 |  __init__(self, column, splits, score, p, dof)
 |
 |  __repr__(self)
 |
 |  name_columns(self, sub)
 |      Substitutes the split column index with a human readable string
 |
 |  sub_split_values(self, sub)
 |      Substitutes the splits with other values into the split_map
 |
 |  valid(self)
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |
 |  __dict__
 |      dictionary for instance variables (if defined)
 |
 |  __weakref__
 |      list of weak references to the object (if defined)
 |
 |  column
 |
 |  dof
 |
 |  groupings
pennylong123 commented 7 years ago

Thank you so much for all your help! For data that consists of strings instead of numbers, does the same procedure for constructing trees work? Also, could you give me a brief description of what each of the p-value, chi-score, and degrees of freedom mean in this context?

Rambatino commented 7 years ago

Hi @pennylong123 the algorithm is based off of:

TREE-CHAID.pdf So that should be able to answer all your questions.

Yes, any categorical data (strings, floats) in the independent variable set will run through the same procedure.

Rambatino commented 7 years ago

Closing due to inactivity.