Closed pomalley44 closed 5 years ago
Can you try also loading the train_data
in the second process before loading the predict_data
?
That worked!
Can you tell me why that worked? And should I have known based on documentation somewhere?
It's dead-nuts the same in Method A and Method B now. I didn't even call on the train_data; it's just sitting there. I spent 2 weeks reviewing source code and messing with the dataset, pickle, hyperparameters... And you solved it in 2 minutes.
I am reopening this issue because it is a bug. You should not need the training data for the unpickled classifier to work properly.
Orange reuses feature descriptors to make your test and train data compatible, so that you can build a classifier on train and then apply it to separate test data.
Feature descriptors therefore depend on how the data is loaded. In your case, I guess, Orange's internal representation of nominal features' values was different depending of how you opened the data. And then the classifier or its internal preprocessors did not handle the differences correctly.
Can you see what happens if you prepare your data in a .tab format where you explicitly list possible values for both of your files? For both files, these need to be the same and also in the same order. See the adult.tab file that comes with Orange for an example.
I can work on that, but it won't be quick. There are a lot of discrete features in my dataset--about 150 which are binary and 20 which are categorical with ~10-100 possible values. It also wouldn't be a feasible long-term solution as the list of possible values changes over time.
I can confirm there are definitely differences between the feature sets each file. The test data contains both binary features and possible categorical values that are not in the train data. It can also lack binary features that are present in the train data. These discrepant features are likely to be to be pretty far down the list in terms of importance, so I can live with the averaging that happens when the tree hits a node that it can't make a decision on.
In setting up the .tab file, how do I handle possible values that have spaces? Can I encapsulate in quotes? Or do I need to rework the dataset to use underscores?
You might also try just switching the order in which model
and predict_data
are loaded. I.e. Load the model
first then the predict_data
(without the train_data
). That might also work.
And then the classifier or its internal preprocessors did not handle the differences correctly.
But no classifier/preprocessor in Orange actually does that.
In setting up the .tab file, how do I handle possible values that have spaces?
Spaces in values listed in the second header row must be escaped with a backslash.
E.g for a columns A with 2 values 'A B' and 'C D':
A
A\ B C\ D
A B
I am afraid that listing all the values in the header is really the only way to build and use reliable/reproducible models in Orange.
Supposedly fixed via #3925.
Orange version
3.17.0
Expected behavior
Two methods below should yield the same predictions:
Method A: Start new python instance Load training data to Orange table (train_data) Create learner
Create model from learner trained on train_data
Pickle model
Load target data to Orange table (predict_data) Unpickle model
Run predictions
Method B: Start new python instance Load target data to Orange table (predict_data) Unpickle model
Run predictions
Actual behavior
Unpickling the model in a new instance and running predictions give different results than when predictions were run from the same instance where the model was trained.
Steps to reproduce the behavior
I have tried reproducing this error with built-in datasets, but I have not been successful doing so.
Additional info (worksheets, data, screenshots, ...)
I'm using random forest regression with
random_state=0, n_estimators=50, min_samples_split=2
. My training data set is ~15,000 rows with ~350 columns--a mixture of continuous features, binary features, and categorical features (that I'm relying on Orange to preprocess with one-hot encoding).The results from Method A and Method B are significantly different. The predicted populations have different means and standard deviations. The predictions in Method B appear much "flatter", with the mean and standard deviation both much lower. However, there is still some correlation--items with high prediction in Method A tend to have a high prediction in Method B (relative to the overall range).
I've included pickling/unpickling in Method A above for symmetry's sake. Leaving the model in memory yields the same results. I've also tried:
Both of the above experiments still yield the same prediction results in Method A.
My suspicion is that some sort of preprocessing is being saved in memory and not saved in the pickled model, but I don't know what that could be or how to figure it out.
I'm happy to try some different things if someone can point me in the right direction.