biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.85k stars 1.01k forks source link

Error on number of expected features when using Discrete Variables on Orange Library #4148

Closed JackDC-12 closed 4 years ago

JackDC-12 commented 4 years ago

Describe the bug I use a single feature to classify a binary target variable using Orange.classification.LogisticRegressionLearner() on a python program. If I instantiate the feature as Continuos, everything is fine, if I instantiate it as Discrete, A valueError appears: ValueError: X has 1 features per sample; expecting 4 Note that 4 is the number of possible values of the Discrete Variable. In fact, If I put just 2 values, the error says expecting 2, and so on

To Reproduce I attached a zip with the python program (just a small test) and the csv file. test.zip

Orange version: 3.23

Expected behavior The error should not appear, and the result should be consistent with the one given by the GUI

Screenshots the error stack trace image

Operating system: Windows 10

Additional context I guess that orange calls the sklearn library directly for the logistic regression. In this case, how are the discrete variables handled?

ajdapretnar commented 4 years ago

You need to instantiate the model correctly.

Use this in your function:

def orange_learner_accuracy(x_train, x_test, y_train, y_test, learner):
    table_train = get_table(x_train, y_train)
    table_test = get_table(x_test, y_test)
    # train the model with the learner
    classifier = learner(table_train)
    # use the model for prediction on test data
    prediction = classifier(table_test)
    ca = np.sum(table_test.Y == prediction)/y_test.shape[0]
    return ca

I have renamed some variables for clarity. You can also simplify your code a lot. There a helper function for transforming pandas DataFrames to Orange.data.Table: from Orange.data.pandas_compat import table_from_frame.

You can also call the scores you need directly. Read more about this in the documentation: https://docs.biolab.si//3/data-mining-library/tutorial/classification.html#classification

JackDC-12 commented 4 years ago

Thank you for the quick reply! If I understood correctly, the catch is to call the classifier on table_test instead of table_test.X. After doing this, I received an error because the domain of table_train and table_test were not the same. Apparently, they have to be the exact same domain instance, it is not sufficient to instantiate 2 equal domains. After this fix, everything worked as expected. Thank you for the helper from pd to table, I was not aware of that! Last question: How discrete variables are translated for been handled by the log reg learner? Is there an automatic one-hot encoding?

Thanks a lot for your help! Giacomo