fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
565 stars 46 forks source link

claasification #87

Closed MehdiBejani closed 1 year ago

MehdiBejani commented 1 year ago

Hello,

Thanks for a great library!

In my project, we recorded eye movement data from 78 individuals, including both controls and patients. We performed 12 different experiments for each person, varying in amplitude, speed, and direction of the target. We extracted 9 features for each recorded signal.

I organized the data into a Pandas DataFrame with 936 rows and 11 columns. The columns are the 9 features, "SPT" (12 different experiments), and "Group" (0 for patients and 1 for controls).

To classify the data using GPBoost, I used the binary classification example provided in the GPBoost GitHub repository (https://github.com/fabsig/GPBoost/blob/master/examples/python-guide/GPBoost_algorithm.py). However, I encountered an error message that says "Incorrect dimension / number of coordinates (=features) in gp_coords_pred". In the simulated data provided in the example, X_test.shape and group_test.shape have different dimensions (i.e., (188, 9) and (500,), respectively).

Could you please help me understand this error and how to resolve it?

import gpboost as gpb
import numpy as np
import sklearn.datasets as datasets
import time
import pandas as pd
from sklearn.model_selection import train_test_split
# Load and preprocess the data
data = pd.read_csv('my_SPT_n')
data.drop("Unnamed: 0", axis=1, inplace=True)
data.drop("Subject", axis=1, inplace=True)
feature_cols = ['feature{}'.format(i) for i in range(10, 71)]
data.drop(feature_cols, axis=1, inplace=True)
data
feature1 feature2 feature3 feature4 feature5 feature6 feature7 feature8 feature9 SPT Group
0 -0.928589 1.614378 1.592074 3.577868 -1.568586 0.530546 0.159122 -1.350469 0.980319 1 1
1 -0.987841 1.069118 1.594379 1.218660 -1.570159 -0.811812 0.693844 -1.000191 0.154145 2 1
2 -1.214750 2.555126 2.074804 2.451309 -1.980481 -0.748817 0.832139 -1.295691 0.591061 3 1
3 -1.201121 1.789987 2.064943 4.233934 -2.007766 0.707415 1.212625 -0.567496 1.044106 4 1
4 -0.084317 1.840200 1.531970 -0.419116 -1.513437 0.613676 0.325077 -0.062547 0.495090 5 1
... ... ... ... ... ... ... ... ... ... ... ...
931 -1.442049 -0.614871 -0.265545 0.027806 0.268085 0.868126 0.036741 -0.014148 0.145427 8 1
932 -1.107084 1.231327 0.613200 -0.890687 -0.644320 0.971961 0.289511 -0.250498 -1.173546 9 1
933 -1.205876 0.828192 0.720568 -0.905005 -0.760680 -0.731711 0.545878 -3.186216 -1.438385 10 1
934 -1.133650 1.321292 0.441928 0.434512 -0.509987 0.864393 0.399253 -2.074334 -0.144804 11 1
935 -1.144700 0.581571 0.492954 -1.176657 -0.519337 -0.606305 0.549326 1.116318 -2.087925 12 1

936 rows × 11 columns

X = data.iloc[:,:9];
y= data.iloc[:,-1];
group=data["SPT"].to_numpy()
likelihood = "bernoulli_probit" # "bernoulli" (=classification)

"""
Combine tree-boosting and grouped random effects model
"""

X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(X, y, group, test_size=0.2, random_state=42)

# Specify boosting parameters as dict
params = {'objective': likelihood, 'learning_rate': 0.01, 'max_depth': 3, 'verbose': 0}
num_boost_round = 250
if likelihood == "gaussian":
    num_boost_round = 50
    params['objective'] = 'regression_l2'
if likelihood in ("bernoulli_probit", "bernoulli_logit"):
    num_boost_round = 500
    params['objective'] = 'binary'

#--------------------Training----------------
# Define random effects model
gp_model = gpb.GPModel(group_data=group_train, likelihood=likelihood)

# Define training data
train_data = gpb.Dataset(X_train, y_train)

# Train model
bst = gpb.train(params=params,
                train_set=train_data,
                gp_model=gp_model,
                num_boost_round=num_boost_round)
gp_model.summary() # Estimated random effects model

#--------------------Prediction----------------
# Predict on test data
y_pred = bst.predict(data=X_test, gp_coords_pred=group_test, 
                        predict_var=True, pred_latent=False)

# Evaluate model
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred > 0.5))
=====================================================
Covariance parameters (random effects):
         Param.
Group_1  0.0001
=====================================================

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

Cell In [8], line 35
     31 gp_model.summary() # Estimated random effects model
     33 #--------------------Prediction----------------
     34 # Predict on test data
---> 35 y_pred = bst.predict(data=X_test, gp_coords_pred=group_test, 
     36                         predict_var=True, pred_latent=False)
     38 # Evaluate model
     39 from sklearn.metrics import classification_report

File c:\users\mehdi\appdata\local\programs\python\python39\lib\site-packages\gpboost\basic.py:3596, in Booster.predict(self, data, start_iteration, num_iteration, pred_latent, pred_leaf, pred_contrib, data_has_header, is_reshape, group_data_pred, group_rand_coef_data_pred, gp_coords_pred, gp_rand_coef_data_pred, cluster_ids_pred, vecchia_pred_type, num_neighbors_pred, predict_cov_mat, predict_var, cov_pars, ignore_gp_model, raw_score, **kwargs)
   3594     random_effect_mean = random_effect_pred['mu']
   3595 else:  # predict response variable (not pred_latent)
-> 3596     pred_resp = self.gp_model.predict(group_data_pred=group_data_pred,
   3597                                       group_rand_coef_data_pred=group_rand_coef_data_pred,
   3598                                       gp_coords_pred=gp_coords_pred,
   3599                                       gp_rand_coef_data_pred=gp_rand_coef_data_pred,
   3600                                       cluster_ids_pred=cluster_ids_pred,
   3601                                       vecchia_pred_type=vecchia_pred_type,
   3602                                       num_neighbors_pred=num_neighbors_pred,
   3603                                       predict_cov_mat=predict_cov_mat,
   3604                                       predict_var=predict_var,
   3605                                       cov_pars=cov_pars,
   3606                                       predict_response=True,
   3607                                       fixed_effects=fixed_effect_train,
   3608                                       fixed_effects_pred=fixed_effect,
   3609                                       y=y)
   3610     response_mean = pred_resp['mu']
   3611     response_var = pred_resp['var']

File c:\users\mehdi\appdata\local\programs\python\python39\lib\site-packages\gpboost\basic.py:5221, in GPModel.predict(self, y, group_data_pred, group_rand_coef_data_pred, gp_coords_pred, gp_rand_coef_data_pred, vecchia_pred_type, num_neighbors_pred, cg_delta_conv_pred, cluster_ids_pred, predict_cov_mat, predict_var, cov_pars, X_pred, use_saved_data, predict_response, fixed_effects, fixed_effects_pred)
   5219         raise ValueError("Incorrect number of data points in gp_coords_pred")
   5220 if gp_coords_pred.shape[1] != self.dim_coords:
-> 5221     raise ValueError("Incorrect dimension / number of coordinates (=features) in gp_coords_pred")
   5222 gp_coords_pred_c, _, _ = c_float_array(gp_coords_pred.flatten(order='F'))
   5223 # Set data for GP random coefficients

ValueError: Incorrect dimension / number of coordinates (=features) in gp_coords_pred
fabsig commented 1 year ago

Thanks a lot for using GPBoost!

You need to use the argument group_data_pred and not gp_coords_pred in the predict() function.

Note that X_test and group_test need to have the same number of rows (= number of prediction points). The same holds true for X_train and group_train.

MehdiBejani commented 1 year ago

Thanks a lot for your prompt response.

I used GPBoost for the classification of clustered data and obtained the following summary result for random effects:

=====================================================
Covariance parameters (random effects):
         Param.
Group_1  0.0001
=====================================================

Based on this result, can we conclude that our data has no random effect, and that mixed effect classification is not suitable for our work? I would appreciate any insights or suggestions for further analysis.

fabsig commented 1 year ago

Yes, maybe. But hard to conclusively tell given the information provided. Note that GitHub issues are for software issues. I hope you do understand that I do not have the time to do counseling regarding statistical analysis.