biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.85k stars 1.01k forks source link

Predictions differ when training/predicting in the same instance versus pickling the model and loading it into a new instance for predictions #3521

Closed pomalley44 closed 5 years ago

pomalley44 commented 5 years ago
Orange version

3.17.0

Expected behavior

Two methods below should yield the same predictions:

Method A: Start new python instance Load training data to Orange table (train_data) Create learner

learner = RandomForestRegressionLearner(random_state=0, n_estimators=50, min_samples_split=2)

Create model from learner trained on train_data

model = learner(train_data)

Pickle model

with open('model.pickle', "wb") as f:
    pickle.dump(model, f, 0)

Load target data to Orange table (predict_data) Unpickle model

with open('model.pickle', "rb") as f:
    model = pickle.load(f)

Run predictions

pred = model(predict_data)

Method B: Start new python instance Load target data to Orange table (predict_data) Unpickle model

with open('model.pickle', "rb") as f:
    model = pickle.load(f)

Run predictions

pred = model(predict_data)
Actual behavior

Unpickling the model in a new instance and running predictions give different results than when predictions were run from the same instance where the model was trained.

Steps to reproduce the behavior

I have tried reproducing this error with built-in datasets, but I have not been successful doing so.

Additional info (worksheets, data, screenshots, ...)

I'm using random forest regression with random_state=0, n_estimators=50, min_samples_split=2 . My training data set is ~15,000 rows with ~350 columns--a mixture of continuous features, binary features, and categorical features (that I'm relying on Orange to preprocess with one-hot encoding).

The results from Method A and Method B are significantly different. The predicted populations have different means and standard deviations. The predictions in Method B appear much "flatter", with the mean and standard deviation both much lower. However, there is still some correlation--items with high prediction in Method A tend to have a high prediction in Method B (relative to the overall range).

I've included pickling/unpickling in Method A above for symmetry's sake. Leaving the model in memory yields the same results. I've also tried:

Both of the above experiments still yield the same prediction results in Method A.

My suspicion is that some sort of preprocessing is being saved in memory and not saved in the pickled model, but I don't know what that could be or how to figure it out.

I'm happy to try some different things if someone can point me in the right direction.

ales-erjavec commented 5 years ago

Can you try also loading the train_data in the second process before loading the predict_data?

pomalley44 commented 5 years ago

That worked!

Can you tell me why that worked? And should I have known based on documentation somewhere?

It's dead-nuts the same in Method A and Method B now. I didn't even call on the train_data; it's just sitting there. I spent 2 weeks reviewing source code and messing with the dataset, pickle, hyperparameters... And you solved it in 2 minutes.

markotoplak commented 5 years ago

I am reopening this issue because it is a bug. You should not need the training data for the unpickled classifier to work properly.

Orange reuses feature descriptors to make your test and train data compatible, so that you can build a classifier on train and then apply it to separate test data.

Feature descriptors therefore depend on how the data is loaded. In your case, I guess, Orange's internal representation of nominal features' values was different depending of how you opened the data. And then the classifier or its internal preprocessors did not handle the differences correctly.

Can you see what happens if you prepare your data in a .tab format where you explicitly list possible values for both of your files? For both files, these need to be the same and also in the same order. See the adult.tab file that comes with Orange for an example.

pomalley44 commented 5 years ago

I can work on that, but it won't be quick. There are a lot of discrete features in my dataset--about 150 which are binary and 20 which are categorical with ~10-100 possible values. It also wouldn't be a feasible long-term solution as the list of possible values changes over time.

I can confirm there are definitely differences between the feature sets each file. The test data contains both binary features and possible categorical values that are not in the train data. It can also lack binary features that are present in the train data. These discrepant features are likely to be to be pretty far down the list in terms of importance, so I can live with the averaging that happens when the tree hits a node that it can't make a decision on.

In setting up the .tab file, how do I handle possible values that have spaces? Can I encapsulate in quotes? Or do I need to rework the dataset to use underscores?

ales-erjavec commented 5 years ago

You might also try just switching the order in which model and predict_data are loaded. I.e. Load the model first then the predict_data (without the train_data). That might also work.

And then the classifier or its internal preprocessors did not handle the differences correctly.

But no classifier/preprocessor in Orange actually does that.

In setting up the .tab file, how do I handle possible values that have spaces?

Spaces in values listed in the second header row must be escaped with a backslash.

E.g for a columns A with 2 values 'A B' and 'C D':

A
A\ B C\ D

A B

I am afraid that listing all the values in the header is really the only way to build and use reliable/reproducible models in Orange.

janezd commented 5 years ago

Supposedly fixed via #3925.