Open TobyRoseman opened 6 years ago
@TobyRoseman Can you take this one if you have some time? I think you'll know better than most where it should be fixed.
This issue is much more subtle than it appears. I'm not exactly sure this is the right fix and I don't think we should take this fix for 5.0. Its an issue that requires quite a bit of discussion.
We chose to make predictions using the most information we have during prediction time. The default option for prediction is imputing when features are not present during predict time. For a single example, if a single feature is missing, imputing is probably the right thing to do. In batch mode, if one example in one row is missing, imputing a single value is the right thing to do. When you add these two things up and want a consistent API, the current API is the only choice.
The user always has the option of turning off imputation in all the predict calls but that's not the default option. Changing the default will revert the "right" thing to do in those other cases but make this case less of a surprise. We could warn the user that we are imputing a lot of data and still continue.
I think we need to differentiate between missing values and missing columns. If a row is missing a value, then imputing makes sense. However if the input SFrame is missing an entire column, I think that should be an error.
When you have only a single example (which we support through dictionaries and SFrames), the missing column and missing feature are the same thing.
@srikris @TobyRoseman As per the title, we're only talking about the case here where there are no known features
. I think it's fundamentally different to impute all feature columns vs. some of them. Imputing all is kind of unexpected and I'm not sure it's ever desirable. To me it seems much more likely that it's a bug in the calling code (that should be found and fixed) vs. an entirely missing row.
Its not a bug. We can impute multiple features. Here are some scenarios. All of these share the same API at the moment and they all impute.
The principle applied during imputing is a point estimate that minimizes the variance in the prediction. Its not a bug at all and is completely by design to accommodate a consistent API and behavior for all of the cases above.
The following case should be an error: train a model on some data then call
predict
with completely unrelated data (i.e. no features in common with the training set).All of turicreate's regression and classification models produce predictions without errors. The following code works without any errors or warnings, and each model gives a prediction: