Make predictions on testing set and calculate the propensity scores

LUSAQX commented 5 years ago

Hi Authors,

Thank you for your great work and open source packages for CHAID implementation.

I am using your package in my project but find little information about how to make prediction from training CHAID model on the testing set and also is it possible to calculate the propensity scores based on the current capacity of this package? Looking forward to your reply, thanks :)

Rambatino commented 5 years ago

Hi @LUSAQX, thanks for getting in touch!

1) You want to run the CHAID model on a dataset (i.e. train it) and then test that on an unseen dataset?

I've written some code that calculates the accuracy of applying it to an unknown set. If you drop an import ipdb; ipdb.set_trace() at line 82 of __main__.py and run python3 -m CHAID tests/data/titanic.csv survived sex embarked --max-depth 4 --min-parent-node-size 2 --alpha-merge 0.05 to hit it. You can then test the model like so:

    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    X = data.drop(nspace.dependent_variable[0], axis=1)
    y = data[nspace.dependent_variable[0]]
    X_train, X_test, Y_train, Y_test = train_test_split(X, y)
    X_train[nspace.dependent_variable[0]] = Y_train
    tree = Tree.from_pandas_df(X_train, types, nspace.dependent_variable[0],
                               **config)
    predictions = tree.model_predictions()
    cols = set([ x.split_variable for x in tree.tree_store if x.split_variable is not None ])
    X_train['predictions'] = tree.model_predictions()
    unique_predictions = X_train[list(cols) + ['predictions']].drop_duplicates()
    X_test[nspace.dependent_variable[0]] = Y_test
    test = unique_predictions.merge(X_test, left_on=list(cols), right_on=list(cols))
    accuracy_score(test[nspace.dependent_variable[0]].tolist(), test['predictions'].tolist())

There should be an in-built function to do this, but we haven't had a use-case for it. If you'd like to commit to the library feel free to submit a PR.

2) There currently isn't a way of calculating the propensity scores in this package. We could look into it though. However, I'm not sure how complex and thus how long it would take.

Rambatino commented 5 years ago

*N.B that code works but it's not been tested properly

LUSAQX commented 5 years ago

Hi @Rambatino, thanks for kind response. After careful review of your code above, I have some points to seek your clarification.

the method tree.model_predictions() seems only make prediction on training data, not the unseen test data;
I saw in your code, you assume all predictor variables are categorical, deduplicate the trainset to have a lookup file for each unique combination of independent variable values. Then for each row in testset, look up the same combination in trainset, if found, get the predicted value as the prediction for that test record.
if so, we may still have some test records out of prediction, if the test values is not any of the combination in the trainset.

Can you clarify if my understanding on your code is what you intend to do?

Cheers.

Rambatino commented 5 years ago

So tree.model_predictions() uses the model, created on the training data and uses that to work out what it would have predicted on that dataset. Of course, this leads to overfitting, but it's a good approximation for calculating the accuracy of that model.

In general, CHAID does assume the predictor variables are categorical, they have to be otherwise you can't run the algorithm. If age is not bucketed for instance, it gives in ordinate amount of potential groups.

To your last point, yes, that is a limitation with my approach there. But if you do a 20/80 split it should be very unlikely that that combination doesn't occur (unless you go really deep with your tree, which is a bad idea on a small dataset anyway)

I think what the library needs is a new class: ML or some other name, that takes a dataset, iteration count and potentially some other variables, uses the hyperopts library and finds the best fitting CHAID tree for the dataset, using multiple iterations of train/test.

But for now, that code should at least work. I may get around to building the new class over this weekend, although it's unlikely that I'll find time.

Rambatino commented 5 years ago

@LUSAQX you're right, it doesn't work for all trees. I'm going to create a more formal solution for this during this week.

Rambatino commented 5 years ago

This adds a predict function (as well as a class for finding the best tree): https://github.com/Rambatino/CHAID/pull/91

Rambatino commented 5 years ago

Closing as stale

Rambatino / CHAID

Make predictions on testing set and calculate the propensity scores #89