Rambatino / CHAID

A python implementation of the common CHAID algorithm
Apache License 2.0
150 stars 50 forks source link

test model performance on validation dataset #23

Closed lijinmaylee closed 7 years ago

lijinmaylee commented 8 years ago

Hi, thank you very much for the implementation!

I plan to split my dataset into development and validation set, then I would like to build the CHAID tree on development set, and test its performance on validation set.

It would be really nice if there could be a function that outputs the rules of segmentation and/or applies the rules (i.e. does model prediction) on the validation set.

Thanks a lot for your help!

Rambatino commented 8 years ago

Hi @lijinmaylee, could you please elaborate on exactly what you mean. What are these rules of segmentation? Also, if you want to build the feature via a Pull Request we'd be happy to review it.

lijinmaylee commented 8 years ago

Hi, thanks for your reply!

What I mean by outputting the decision tree rules is similar to what is described in this thread [http://stackoverflow.com/questions/28428508/chaid-regression-tree-to-table-conversion-in-r].

And by saying test its performance on validation set, what I expect is something similar to the predict function in R, which is, for a new data point, given the decision tree built on the development set, I could know which node it falls into, and what the model prediction for its class label is.

Thanks for your consideration!

Rambatino commented 8 years ago

Is it something like this:

 /* Node 31 */.
IF (S16_Top2_27 != 0)  AND  (S16_Top2_21 = 0)  AND  (S16_Top2_24 != 0)  AND  (S16_Top2_2 = 0)  AND  (S16_Top2_6 != 0)
THEN
Node = 31
Prediction = 0
Probability = 0.697674

/* Node 32 */.
IF (S16_Top2_27 != 0)  AND  (S16_Top2_21 = 0)  AND  (S16_Top2_24 != 0)  AND  (S16_Top2_2 = 0)  AND  (S16_Top2_6 = 0)
THEN
Node = 32
Prediction = 1
Probability = 0.586538

/* Node 33 */.
IF (S16_Top2_27 != 0)  AND  (S16_Top2_21 = 0)  AND  (S16_Top2_24 != 0)  AND  (S16_Top2_2 != 0)  AND  (S16_Top2_17 != 0)
THEN
Node = 33
Prediction = 0
Probability = 0.920592

/* Node 34 */.
IF (S16_Top2_27 != 0)  AND  (S16_Top2_21 = 0)  AND  (S16_Top2_24 != 0)  AND  (S16_Top2_2 != 0)  AND  (S16_Top2_17 = 0)
THEN
Node = 34
Prediction = 0
Probability = 0.683333

/* Node 35 */.
IF (S16_Top2_27 != 0)  AND  (S16_Top2_21 = 0)  AND  (S16_Top2_24 = 0)  AND  (S16_Top2_22 != 0)  AND  (S16_Top2_17 != 0)
THEN
Node = 35
Prediction = 0
Probability = 0.671795

/* Node 36 */.
IF (S16_Top2_27 != 0)  AND  (S16_Top2_21 = 0)  AND  (S16_Top2_24 = 0)  AND  (S16_Top2_22 != 0)  AND  (S16_Top2_17 = 0)
THEN
Node = 36
Prediction = 1
Probability = 0.625000

I'm having a hard way of pinning it down exactly what you mean, even with that link.

If that isn't the kind of thing you want, could you please be very specific in your request and we'll try and build it.

lijinmaylee commented 8 years ago

Hi,

That is exactly what I meant.

Thanks so much!

xulaus commented 8 years ago

Hi @lijinmaylee, did you want this a text output, or some sort of object that will do the classification for you?

Rambatino commented 8 years ago

It shouldn't be too hard to build, and would be good for debugging. I'd prefer a more structured output though

lijinmaylee commented 8 years ago

hi @xulaus , an object would be preferred.

Thanks a lot for your effort!

xulaus commented 8 years ago

Still some argument here about what you mean. Would a pandas data frame fit your use case? If not could you provide some example code of how you would like the API to look?

lijinmaylee commented 8 years ago

Hi,

I would expect that the test/validation dataset will have exactly the same structure as the development dataset - both are pandas dataframes with same columns.

Hope that this clarifies the confusion. Thanks for your consideration.

Tizpaz commented 8 years ago

Hello,

Could you please help me, how I can find performance (accuracy) of a CHIAD model? I use cross_validation and split the data to 80% for train and 20% for test. After training the CHAID decision tree, now I want to know how I can measure the accuracy for 20% of test data:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.2)

tree = CHAID(X_train,y_train,max_depth=20,split_titles=header).to_tree().show()

?? (* how to measure the accuracy of model for 20% of test data *)

Thanks, Saeid

Rambatino commented 8 years ago

Are these the same issues? We're most likely going to be building a way of returning the decision rules of the tree.

How are you defining 'accuracy' @Tizpaz ? Can you point us to some resources and we can look into building this feature

Tizpaz commented 8 years ago

Maybe this is another issue. In Machine Learning, after training a classifier (like CHAID) for example, with 80% of data set, we need to measure the accuracy of the classifier using the remain 20%. This is usually done be cross validation function which splits the data to two train and test data set. Now, the accuracy is the rate of correct prediction of class label for the test set: Accuracy = (True Positive + True Negative) / (True Positive + True Negative + False Positive + False Negative) for binary class label. Please refer to: https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers for more information.

Rambatino commented 8 years ago

What you are asking for (and do correct me if I'm wrong) is to determine the accuracy of the model. You have a subset and you essentially train the model on it, you then apply that same logic which derived the respondents' terminal nodes in the first set (e.g. people who are male, 20-25 and blonde fall into this terminal node (which is what this issue here deals with)) and then you look for the same set of characteristics for individuals in the second data set and work out the frequency by which those stopping rules still apply.

Let me break down how I think the logic will work, you have a function:

CHAID.from_pandas_df(...).accuracy(test_size: 0.2)

Which will split your dataset into 80% to train it, and the remaining 20% to test it (where a respondent/row falls within this split will be random?)

Now we need to define exactly what we mean by TP, TN, FP, FN and I kind of think we have something similar, but quite distinct concept (model risk). Let me break it down how this will all work:

You build the stopping rules for a tree. Let's take 3 independent variables X1, X2 & X3 with a single Y.

N8 (which is a terminal node) has the stopping rule X1{1,2} & X2{1} (which means that members of this particular node are in the 1 & 2 group of X1 (for instance are either blonde or brown haired) and in the 1 group of X2 (are male in the group gender).

Now these stopping rules are important, because they lead onto this next step. Members of this group made a choice about Y (which in this instance is a binary dependent variable, but it could have more attributes). Let's say 30% chose No and 70% chose yes for "do you drink coca-cola" (if you want to take a marketing slant).

So for this terminal node (N8) which conforms to the rule X1{1,2} & X2{1} predicts that someone is aware of coca-cola (because most people in this set are aware of coca cola).

Now these stopping rules define 80% of your data. An individual with these characteristics is predicted to select "yes".

Then we run through the other 20% of the data. You have the same set of characteristics X1, X2 & X3 and you have Y. All you need to do is run through this dataset and look at what each respondent chose for Y, given that they have specific X1, X2 and X3.

So going back to the previous example for N8 and determining TP, TN, FP & FN:

Y{1} = Yes ....positive selection Y{2} = No .... negative selection

X1 | X2 | X3 | Y
1  |  1 | 2  | 1 <=== TP (selects yes and you predict yes given the stopping rules)
2  |  1 | 1  | 2 <=== FN (selects no, but you predict that they would have said yes, given rules)
1  |  2 | 1  | 1 <=== FP (selects yes, but the node defined by this splitting rule (some other node) predicts no)
1  |  2 | 1  | 2 <=== TN (selects no and predicted no from this X values (similar to last node))

Does this clarify things?

Also, what if it isn't a binary dependent split (do you somehow group the top 2 as positive and the bottom 2 as negative if there are four, or is it just the top one as positive: maybe it should be a decision made?

i.e. for Y{1.0, 2.0, 3.0, 4.0}

CHAID.from_pandas_df(...).accuracy(test_size: 0.2, positive: [1.0])

would treat 1 as positive and all other negative.

Rambatino commented 8 years ago

There is a new PR (which will need some work https://github.com/Rambatino/CHAID/pull/36, which will produce the stopping rules.. or at the very least this method should be a precursor to that)

Also, we do have a 'risk' calculation, which looks at the observed dependent choice, compared to the predicted (as defined by the highest choice that row fell into) and calculates how well the model predicts the observed value

xulaus commented 7 years ago

Fixed with PR #47, will be in version 3.0.0