fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
556 stars 44 forks source link

how to implement this on GPBoost on CPG data with binary classfification #29

Closed sacrathi1991 closed 3 years ago

sacrathi1991 commented 3 years ago

I have sales data where i need to predict which product would sold or not (Binary classification) I have considered group variable as my shop id and my product id variable in dataframe.

I am trying to implement "classification_non_Gaussian_data" solution on my problem statement. Now I got below issue described below.

  1. Check failed: (static_cast(traindata->num_total_features())) == (config->monotone_constraints.size()) at /home/whsigris/Dropbox/HSLU/Projects/MixedBoost/GPBoost/python-package/compile/src/LightGBM/boosting/gbdt.cpp, line 55 .
  2. After fixing 1st issue by assigning any data variable to monotonic constraint and then pass it to parameter( I dont know where its a right approach to fix this or not) , Now Model is taking too much time to run (May be due to huge data).
  3. How to get prediction in yes/No once i train model as i hav'nt find proper result in code solution example shared on github by you.
fabsig commented 3 years ago

Thank you for your interest in GPBoost!

Can you please provide a minimal working example including (synthetic) data to reproduce your issue(s)? Otherwise, it is almost impossible to diagnose what is going on.

sacrathi1991 commented 3 years ago

Hello fabsig,

Thank you for you prompt response. I have attached my code and sample data with this comment. Please let me know if you need some other information. Our objective is to predict Y (V_22) on test data considering the group effect of my two columns i.e shop id and product id. pep_data.xlsx `` image Attaching Code here

`import gpboost as gpb from scipy import stats import pandas as pd

Below columns are categorical but we had label encode them.

Categorical Variables = [Shop_id,Product_id,V_8,V_9,V_10,V_11,V_12,V_13,V_15,V_16]

train_data = pd.read_excel(path + '/pep_data.xlsx', sheet_name='Train_data') test_data = pd.read_excel(path + '/pep_data.xlsx', sheet_name='Test_Data')

V_22 is our dependent variable

X = train_data.drop('V_22' ,axis = 1) y = train_data['V_22']

We want measure random effect for below two columns (hierarchically nested ones)

group = X[['Shop_id','Product_id']].copy()

likelihood = "bernoulli_logit"

Note: the tuning parameters are by no means optimal for all situations considered here

params = {'learning_rate': 0.1, 'min_data_in_leaf': 20, 'objective': likelihood, 'verbose': 0, 'monotone_constraints': [1,0]}

num_boost_round = 25

if likelihood == "bernoulli_logit": num_boost_round = 50 if likelihood in ("bernoulli_probit", "bernoulli_logit"): params['objective'] = 'binary'

dropping group data columns now from data set

X.drop(group.columns.tolist(),axis=1,inplace=True)

create dataset for gpb.train

data_train = gpb.Dataset(X, y)

Train model

gp_model = gpb.GPModel(group_data=group, likelihood=likelihood)

Use the option "trace": true to monitor convergence of hyperparameter estimation of the gp_model. E.g.:

gp_model.set_optim_params(params={"trace": True})

bst = gpb.train(params=params, train_set=data_train, gp_model=gp_model, num_boost_round=num_boost_round) gp_model.summary() # Trained random effects model (true variance = 0.5)

Showing training loss

gp_model = gpb.GPModel(group_data=group, likelihood=likelihood) bst = gpb.train(params=params, train_set=data_train, gp_model=gp_model, num_boost_round=num_boost_round, valid_sets=data_train)`

fabsig commented 3 years ago

In your code, you assign monotone constraints for two features, but you have more than two features. This is the reason for the correctly raised error. I suggest you drop 'monotone_constraints': [1,0] from your code.

It takes a few seconds to run your example (approx. 30 seconds on my laptop). Depending on the grouping / clustering structure, inference for non-Gaussian mixed effects models is computationally demanding as matrices might become dense. I noticed that your first random effect ('Shop_id') has an estimated variance which is quite small. If you use only one random effect ('Product_id'), the problem becomes easier and the code runs much faster. Note that it might well be that the corresponding C++ code for the case of more than one grouped random effects for non-Gaussian data can be optimized. Suggestions are welcome.

Predictions can be obtained as shown here. If you want 0/1's instead of probabilities, just check which entry is above 0.5 (or you might also want to use another threshold...).