Closed pennylong123 closed 7 years ago
Do you know @asram6 - he asked pretty much this same question earlier today.
I'll update the README now.
Hi @pennylong123 please have a look at:
And give your opinions.
Thank you! Also, what does 'invalid CHAID split' mean? I thought it was when the subset at a certain node is "pure" and so cannot be split any further. However, I have tried some examples of my own and gotten that message.
For example, using the following data gives me ([], {0: 5, 1: 5}, <Invalid Chaid Split>)
:
a b c
1 0 1
0 1 1
1 0 1
0 1 1
1 0 1
0 1 0
1 1 0
0 0 0
1 0 0
0 0 0
Why can't it split on variable a or b?
It just means that certain thresholds have not been met yet. So it's reached max depth, or gone below minimum thresholds for a node. You can change these to make it split more. Eg.
https://github.com/Rambatino/CHAID/commit/53f6aae255774b25444955f5ce45d645ff168871
:thinking: maybe a more descriptive reason as to why it can't split would be appropriate
Yes, I changed the min_child_node_size, but there seem to be other thresholds that are stopping it from splitting. You mentioned max depth -- are there other thresholds too? And how can I set all of them when I am constructing the Tree?
See here for the parameters you can pass into the tree building algorithm:
But also, with such small data, it's unlikely that it will find anything to split on at any meaningful level.
On a final note, you can run:
>>> import CHAID
>>> help(CHAID.Tree) # or
>>> help(CHAID.Split)
and it will give you:
help on class Split in module CHAID.split:
class Split(__builtin__.object)
| A potential split for a node in to produce children
|
| Parameters
| ----------
| column : float
| The key of where the split is occuring relative to the input data
| splits : array-like
| The grouped variables
| split_map : array-like
| The name of the grouped variables
| score : float
| The score value of that split
| p : float
| The p value of that split
| dof : int
| The degrees of freedom as a result of this split
|
| Methods defined here:
|
| __init__(self, column, splits, score, p, dof)
|
| __repr__(self)
|
| name_columns(self, sub)
| Substitutes the split column index with a human readable string
|
| sub_split_values(self, sub)
| Substitutes the splits with other values into the split_map
|
| valid(self)
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| column
|
| dof
|
| groupings
Thank you so much for all your help! For data that consists of strings instead of numbers, does the same procedure for constructing trees work? Also, could you give me a brief description of what each of the p-value, chi-score, and degrees of freedom mean in this context?
Hi @pennylong123 the algorithm is based off of:
TREE-CHAID.pdf So that should be able to answer all your questions.
Yes, any categorical data (strings, floats) in the independent variable set will run through the same procedure.
Closing due to inactivity.
I am new to the CHAID algorithm in general and also to this package, so could you please give me a little clarification to what certain things mean? From what I understand so far, the dataset is a numpy array containing numbers, with several independent variables and one dependent variable. Then, constructing a tree gives us the CHAID output. Is this correct so far? For the actual tree, could someone help me understand what this output means in the context of the example in the README?
(I know these are all simple questions but I am new to all of this so would like to make sure I understand the basics)
Thank you