Rambatino / CHAID

A python implementation of the common CHAID algorithm
Apache License 2.0
150 stars 50 forks source link

Continuous independent variables? #75

Closed asram6 closed 7 years ago

asram6 commented 7 years ago

Hi,

Is there any way to specify that certain independent variables are continuous? For example, if I have a variable for age, I would like that to be continuous so that for the splits, it gives ranges of age instead of each individual age in that group.

Thanks!

asram6 commented 7 years ago

It seems that for independent variables, we can only classify them as nominal or ordinal. Is this correct? Please let me know if there is any way to have continuous independent variables?

Rambatino commented 7 years ago

There isn't.

How do you group a continuous variable? You can't really, which means you have to bucket it. So on your side, you'd have to bucket age into logical groupings. E.g. 0-10, 10-20, 20-30 etc... and then make that a ordinal categorical variable. You need a sufficient base size for each grouping, ideally.

Alternatively, if you had age ranges from 0-100 (the natural numbers up to 100), and tens of thousands of respondents, then you could just leave it as is. However, with a smaller sample size, you'd have to create your groupings before hand, and then pass that in.

asram6 commented 7 years ago

Ok, so there isn't really anything that can be done if I have an age variable that is not in buckets? By grouping the continuous variable, I actually just meant that if there is a node that looks like this:

(['female'], {1': 299, 0': 277}, (Age, p=x, score=y, groups=[['2, 25, 30, 42, 16'], ['17']]), dof=1)) I would like it to instead do this: (['female'], {1': 299, 0': 277}, (Age, p=x, score=y, groups=[['2-16, 25-42'], ['17']]), dof=1)) So that if there are 100 different ages, the nodes don't explode into huge lists of single numbers.

Rambatino commented 7 years ago

Ahh so you are passing them in individually. Make sure you're passing in them as ordinal type, otherwise it will invalidate the whole analysis. It does kinda look nominal there.

I guess it's not a totally unreasonable thing to want something that is ordinal to have some intrinsic grouping when printed.

@xulaus what do you think?

xulaus commented 7 years ago

The trouble comes in where there is non-integer ordinal values. i.e. "Strongly Agree", "Agree", "Disagree", "Strongly Disagree". How do you automatically decide how to print them? What about other languages?

That said there does look like something funky going on in the grouping there as connecting integers with commas doesn't seem like it would be the right default.

Rambatino commented 7 years ago

yeah but if it was ordinal and numeric and whole numbers. Then it would work? Which is the case for ages.

But you're right, I don't understand the printing there. Those values should not be string concatenated.

@asram6 what if you do:

node = _that node you have_
print(node.groupings)
asram6 commented 7 years ago

It says "'Node' object has no attribute 'groupings.'" I think for now I will try and convert my data into buckets instead of passing in individual numbers. How do I specify the independent variable types (ordinal or nominal) when I use the Tree constructor (not the from_pandas_df method)? If I do something like: tree = Tree(ndarr, arr, split_titles=['a', 'b', 'c'], min_child_node_size=5, variable_types = {'a' : 'ordinal'}), it says "Unknown independent variable type a"

Rambatino commented 7 years ago

Ahh it's [node.split.groupings](https://github.com/Rambatino/CHAID/blob/master/CHAID/split.py#L62), rather.

And for that other issue, try:

tree = Tree(ndarr, arr, split_titles=['a', 'b', 'c'], min_child_node_size=5, variable_types =['ordinal', 'nominal', 'nominal'])