apple / turicreate

Turi Create simplifies the development of custom machine learning models.
BSD 3-Clause "New" or "Revised" License
11.2k stars 1.14k forks source link

Model prediction should fail if it finds no known features #933

Open TobyRoseman opened 6 years ago

TobyRoseman commented 6 years ago

The following case should be an error: train a model on some data then call predict with completely unrelated data (i.e. no features in common with the training set).

All of turicreate's regression and classification models produce predictions without errors. The following code works without any errors or warnings, and each model gives a prediction:

import turicreate as tc

def check_all_missing_column_classifier_predict(create_method):
    data = tc.SFrame({'feature1': [9,9,1,1], 'feature2': [1,1,9,9], 'label': ['a','a','b','b']})
    model = tc.boosted_trees_classifier.create(data, target='label')

    other_data = tc.SFrame({'feature3': [5]})
    result = model.predict(other_data)
    print("Prediction: {}".format(result))

check_all_missing_column_classifier_predict(tc.boosted_trees_classifier)
check_all_missing_column_classifier_predict(tc.decision_tree_classifier)
check_all_missing_column_classifier_predict(tc.logistic_classifier)
check_all_missing_column_classifier_predict(tc.nearest_neighbor_classifier)
check_all_missing_column_classifier_predict(tc.random_forest_classifier)
check_all_missing_column_classifier_predict(tc.svm_classifier)

def check_all_missing_column_regressor_predict(create_method):
    data = tc.SFrame({'feature1': [9,9,1,1], 'feature2': [1,1,9,9], 'y': [0,0,10,10]})
    model = tc.boosted_trees_classifier.create(data, target='y')

    other_data = tc.SFrame({'feature3': [5]})
    result = model.predict(other_data)
    print("Prediction: {}".format(result))

check_all_missing_column_regressor_predict(tc.boosted_trees_regression)
check_all_missing_column_regressor_predict(tc.decision_tree_regression)
check_all_missing_column_regressor_predict(tc.linear_regression)
check_all_missing_column_regressor_predict(tc.random_forest_regression)
znation commented 6 years ago

@TobyRoseman Can you take this one if you have some time? I think you'll know better than most where it should be fixed.

srikris commented 6 years ago

This issue is much more subtle than it appears. I'm not exactly sure this is the right fix and I don't think we should take this fix for 5.0. Its an issue that requires quite a bit of discussion.

We chose to make predictions using the most information we have during prediction time. The default option for prediction is imputing when features are not present during predict time. For a single example, if a single feature is missing, imputing is probably the right thing to do. In batch mode, if one example in one row is missing, imputing a single value is the right thing to do. When you add these two things up and want a consistent API, the current API is the only choice.

The user always has the option of turning off imputation in all the predict calls but that's not the default option. Changing the default will revert the "right" thing to do in those other cases but make this case less of a surprise. We could warn the user that we are imputing a lot of data and still continue.

TobyRoseman commented 6 years ago

I think we need to differentiate between missing values and missing columns. If a row is missing a value, then imputing makes sense. However if the input SFrame is missing an entire column, I think that should be an error.

srikris commented 6 years ago

When you have only a single example (which we support through dictionaries and SFrames), the missing column and missing feature are the same thing.

znation commented 6 years ago

@srikris @TobyRoseman As per the title, we're only talking about the case here where there are no known features. I think it's fundamentally different to impute all feature columns vs. some of them. Imputing all is kind of unexpected and I'm not sure it's ever desirable. To me it seems much more likely that it's a bug in the calling code (that should be found and fixed) vs. an entirely missing row.

srikris commented 6 years ago

Its not a bug. We can impute multiple features. Here are some scenarios. All of these share the same API at the moment and they all impute.

The principle applied during imputing is a point estimate that minimizes the variance in the prediction. Its not a bug at all and is completely by design to accommodate a consistent API and behavior for all of the cases above.