Closed LUSAQX closed 5 years ago
Hi @LUSAQX, thanks for getting in touch!
1) You want to run the CHAID model on a dataset (i.e. train it) and then test that on an unseen dataset?
I've written some code that calculates the accuracy of applying it to an unknown set. If you drop an import ipdb; ipdb.set_trace()
at line 82 of __main__.py
and run python3 -m CHAID tests/data/titanic.csv survived sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05
to hit it. You can then test the model like so:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X = data.drop(nspace.dependent_variable[0], axis=1)
y = data[nspace.dependent_variable[0]]
X_train, X_test, Y_train, Y_test = train_test_split(X, y)
X_train[nspace.dependent_variable[0]] = Y_train
tree = Tree.from_pandas_df(X_train, types, nspace.dependent_variable[0],
**config)
predictions = tree.model_predictions()
cols = set([ x.split_variable for x in tree.tree_store if x.split_variable is not None ])
X_train['predictions'] = tree.model_predictions()
unique_predictions = X_train[list(cols) + ['predictions']].drop_duplicates()
X_test[nspace.dependent_variable[0]] = Y_test
test = unique_predictions.merge(X_test, left_on=list(cols), right_on=list(cols))
accuracy_score(test[nspace.dependent_variable[0]].tolist(), test['predictions'].tolist())
There should be an in-built function to do this, but we haven't had a use-case for it. If you'd like to commit to the library feel free to submit a PR.
2) There currently isn't a way of calculating the propensity scores in this package. We could look into it though. However, I'm not sure how complex and thus how long it would take.
*N.B that code works but it's not been tested properly
Hi @Rambatino, thanks for kind response. After careful review of your code above, I have some points to seek your clarification.
Can you clarify if my understanding on your code is what you intend to do?
Cheers.
So tree.model_predictions()
uses the model, created on the training data and uses that to work out what it would have predicted on that dataset. Of course, this leads to overfitting, but it's a good approximation for calculating the accuracy of that model.
In general, CHAID does assume the predictor variables are categorical, they have to be otherwise you can't run the algorithm. If age is not bucketed for instance, it gives in ordinate amount of potential groups.
To your last point, yes, that is a limitation with my approach there. But if you do a 20/80 split it should be very unlikely that that combination doesn't occur (unless you go really deep with your tree, which is a bad idea on a small dataset anyway)
I think what the library needs is a new class: ML or some other name, that takes a dataset, iteration count and potentially some other variables, uses the hyperopts library and finds the best fitting CHAID tree for the dataset, using multiple iterations of train/test.
But for now, that code should at least work. I may get around to building the new class over this weekend, although it's unlikely that I'll find time.
@LUSAQX you're right, it doesn't work for all trees. I'm going to create a more formal solution for this during this week.
This adds a predict function (as well as a class for finding the best tree): https://github.com/Rambatino/CHAID/pull/91
Closing as stale
Hi Authors,
Thank you for your great work and open source packages for CHAID implementation.
I am using your package in my project but find little information about how to make prediction from training CHAID model on the testing set and also is it possible to calculate the propensity scores based on the current capacity of this package? Looking forward to your reply, thanks :)