sacrathi1991 commented 3 years ago

I have sales data where i need to predict which product would sold or not (Binary classification) I have considered group variable as my shop id and my product id variable in dataframe.

I am trying to implement "classification_non_Gaussian_data" solution on my problem statement. Now I got below issue described below.

Check failed: (static_cast(traindata->num_total_features())) == (config->monotone_constraints.size()) at /home/whsigris/Dropbox/HSLU/Projects/MixedBoost/GPBoost/python-package/compile/src/LightGBM/boosting/gbdt.cpp, line 55 .
After fixing 1st issue by assigning any data variable to monotonic constraint and then pass it to parameter( I dont know where its a right approach to fix this or not) , Now Model is taking too much time to run (May be due to huge data).
How to get prediction in yes/No once i train model as i hav'nt find proper result in code solution example shared on github by you.

fabsig commented 3 years ago

Thank you for your interest in GPBoost!

Can you please provide a minimal working example including (synthetic) data to reproduce your issue(s)? Otherwise, it is almost impossible to diagnose what is going on.

sacrathi1991 commented 3 years ago

Hello fabsig,

Thank you for you prompt response. I have attached my code and sample data with this comment. Please let me know if you need some other information. Our objective is to predict Y (V_22) on test data considering the group effect of my two columns i.e shop id and product id. pep_data.xlsx `` Attaching Code here

`import gpboost as gpb from scipy import stats import pandas as pd

Below columns are categorical but we had label encode them.

Categorical Variables = [Shop_id,Product_id,V_8,V_9,V_10,V_11,V_12,V_13,V_15,V_16]

train_data = pd.read_excel(path + '/pep_data.xlsx', sheet_name='Train_data') test_data = pd.read_excel(path + '/pep_data.xlsx', sheet_name='Test_Data')

V_22 is our dependent variable

X = train_data.drop('V_22' ,axis = 1) y = train_data['V_22']

We want measure random effect for below two columns (hierarchically nested ones)

group = X[['Shop_id','Product_id']].copy()

likelihood = "bernoulli_logit"

Note: the tuning parameters are by no means optimal for all situations considered here

params = {'learning_rate': 0.1, 'min_data_in_leaf': 20, 'objective': likelihood, 'verbose': 0, 'monotone_constraints': [1,0]}

num_boost_round = 25

if likelihood == "bernoulli_logit": num_boost_round = 50 if likelihood in ("bernoulli_probit", "bernoulli_logit"): params['objective'] = 'binary'

dropping group data columns now from data set

X.drop(group.columns.tolist(),axis=1,inplace=True)

create dataset for gpb.train

data_train = gpb.Dataset(X, y)

Train model

gp_model = gpb.GPModel(group_data=group, likelihood=likelihood)

Use the option "trace": true to monitor convergence of hyperparameter estimation of the gp_model. E.g.:

gp_model.set_optim_params(params={"trace": True})

bst = gpb.train(params=params, train_set=data_train, gp_model=gp_model, num_boost_round=num_boost_round) gp_model.summary() # Trained random effects model (true variance = 0.5)

Showing training loss

gp_model = gpb.GPModel(group_data=group, likelihood=likelihood) bst = gpb.train(params=params, train_set=data_train, gp_model=gp_model, num_boost_round=num_boost_round, valid_sets=data_train)`

fabsig commented 3 years ago

In your code, you assign monotone constraints for two features, but you have more than two features. This is the reason for the correctly raised error. I suggest you drop 'monotone_constraints': [1,0] from your code.

It takes a few seconds to run your example (approx. 30 seconds on my laptop). Depending on the grouping / clustering structure, inference for non-Gaussian mixed effects models is computationally demanding as matrices might become dense. I noticed that your first random effect ('Shop_id') has an estimated variance which is quite small. If you use only one random effect ('Product_id'), the problem becomes easier and the code runs much faster. Note that it might well be that the corresponding C++ code for the case of more than one grouped random effects for non-Gaussian data can be optimized. Suggestions are welcome.

Predictions can be obtained as shown here. If you want 0/1's instead of probabilities, just check which entry is above 0.5 (or you might also want to use another threshold...).

fabsig / GPBoost

how to implement this on GPBoost on CPG data with binary classfification #29

Below columns are categorical but we had label encode them.

V_22 is our dependent variable

We want measure random effect for below two columns (hierarchically nested ones)

Note: the tuning parameters are by no means optimal for all situations considered here

dropping group data columns now from data set

create dataset for gpb.train

Train model

Use the option "trace": true to monitor convergence of hyperparameter estimation of the gp_model. E.g.:

gp_model.set_optim_params(params={"trace": True})

Showing training loss