Closed sacrathi1991 closed 3 years ago
Thank you for your interest in GPBoost!
Can you please provide a minimal working example including (synthetic) data to reproduce your issue(s)? Otherwise, it is almost impossible to diagnose what is going on.
Hello fabsig,
Thank you for you prompt response. I have attached my code and sample data with this comment. Please let me know if you need some other information. Our objective is to predict Y (V_22) on test data considering the group effect of my two columns i.e shop id and product id. pep_data.xlsx `` Attaching Code here
`import gpboost as gpb from scipy import stats import pandas as pd
Categorical Variables = [Shop_id,Product_id,V_8,V_9,V_10,V_11,V_12,V_13,V_15,V_16]
train_data = pd.read_excel(path + '/pep_data.xlsx', sheet_name='Train_data') test_data = pd.read_excel(path + '/pep_data.xlsx', sheet_name='Test_Data')
X = train_data.drop('V_22' ,axis = 1) y = train_data['V_22']
group = X[['Shop_id','Product_id']].copy()
likelihood = "bernoulli_logit"
params = {'learning_rate': 0.1, 'min_data_in_leaf': 20, 'objective': likelihood, 'verbose': 0, 'monotone_constraints': [1,0]}
num_boost_round = 25
if likelihood == "bernoulli_logit": num_boost_round = 50 if likelihood in ("bernoulli_probit", "bernoulli_logit"): params['objective'] = 'binary'
X.drop(group.columns.tolist(),axis=1,inplace=True)
data_train = gpb.Dataset(X, y)
gp_model = gpb.GPModel(group_data=group, likelihood=likelihood)
bst = gpb.train(params=params, train_set=data_train, gp_model=gp_model, num_boost_round=num_boost_round) gp_model.summary() # Trained random effects model (true variance = 0.5)
gp_model = gpb.GPModel(group_data=group, likelihood=likelihood) bst = gpb.train(params=params, train_set=data_train, gp_model=gp_model, num_boost_round=num_boost_round, valid_sets=data_train)`
In your code, you assign monotone constraints for two features, but you have more than two features. This is the reason for the correctly raised error. I suggest you drop 'monotone_constraints': [1,0]
from your code.
It takes a few seconds to run your example (approx. 30 seconds on my laptop). Depending on the grouping / clustering structure, inference for non-Gaussian mixed effects models is computationally demanding as matrices might become dense. I noticed that your first random effect ('Shop_id') has an estimated variance which is quite small. If you use only one random effect ('Product_id'), the problem becomes easier and the code runs much faster. Note that it might well be that the corresponding C++ code for the case of more than one grouped random effects for non-Gaussian data can be optimized. Suggestions are welcome.
Predictions can be obtained as shown here. If you want 0/1's instead of probabilities, just check which entry is above 0.5 (or you might also want to use another threshold...).
I have sales data where i need to predict which product would sold or not (Binary classification) I have considered group variable as my shop id and my product id variable in dataframe.
I am trying to implement "classification_non_Gaussian_data" solution on my problem statement. Now I got below issue described below.