fabsig / GPBoost

Combining tree-boosting with Gaussian process and mixed effects models
Other
556 stars 44 forks source link

How to specify "ind_effect_group_rand_coef" for multiple random slopes #33

Closed YuanyuanLi96 closed 3 years ago

YuanyuanLi96 commented 3 years ago

I want to fit random slopes using the same covariates with those for fixed effects (i.e., Z=X). But I don't know how to specify "ind_effect_group_rand_coef". I kept getting errors as below:

ind_re_train=[i for i in range(1,full_X_train.shape[1]+1)] gp_model = gpb.GPModel(group_data=group_train,likelihood = "bernoulli_probit",group_rand_coef_data=full_X_train, \ ind_effect_group_rand_coef=ind_re_train)

IndexError: list index out of range

fabsig commented 3 years ago

Thank you for your interest in GPBoost.

ind_effect_group_rand_coef is important for situations when there are more than one ("intercept") grouped random effects, i.e., more than one column in group_data. ind_effect_group_rand_coef contains indices that indicate the corresponding random effects (=columns) in group_data for every covariate in group_rand_coef_data. For instance, [1,1,2] means that the first two covariates (=first two columns) in group_rand_coef_data have random coefficients corresponding to the first random effect (=first column) in group_data, and the third covariate (=third column) in group_rand_coef_data has a random coefficient corresponding to the second random effect (=second column) in group_data.

In case you have only one (intercept) grouped random effects, ind_effect_group_rand_coef simply contains 1's, i.e., ind_effect_group_rand_coef = np.ones(full_X_train.shape[1]).

Hope this helps. I will also update the corresponding documentation.

YuanyuanLi96 commented 3 years ago

I only have one column in group_data, and 28 columns in group_rand_coef_data(=full_X_train). I tried using ind_effect_group_rand_coef = np.ones(full_X_train.shape[1]) in my GPModel function, but I still got list index out of range error. Could you check if anything might be wrong in my case? Or sharing a similar example will be very helpful!

Thanks for your work on this interesting topic!

fabsig commented 3 years ago

Can you double check that full_X_train.shape[1] is indeed 28 and that you have no other dimension mismatches (eg. number of data points etc.)? Otherwise, please provide a minimal working example to reproduce the error.

Below is a working example adapted from here that runs without any problems.

import gpboost as gpb
import numpy as np

n = 1000  # number of samples
# Simulate single level grouped random effects data
m = 100  # number of categories / levels for grouping variable
np.random.seed(1)
group = np.arange(n)  # grouping variable
for i in range(m):
    group[int(i * n / m):int((i + 1) * n / m)] = i
b = 1 * np.random.normal(size=m)  # simulate random effects
eps = b[group]
X = np.column_stack(
    (np.ones(n), np.random.uniform(size=n)))  # design matrix / covariate data for fixed effect
beta = np.array([0, 3])  # regression coefficents
xi = np.sqrt(0.01) * np.random.normal(size=n)  # simulate error term
y = eps + xi + X.dot(beta) # observed data

gp_model = gpb.GPModel(group_data=group, group_rand_coef_data=X, ind_effect_group_rand_coef=np.ones(X.shape[1]))
gp_model.fit(y=y)
gp_model.summary()
YuanyuanLi96 commented 3 years ago

I got an error as below by coping and running the above example. I installed the package using "!pip install gpboost". Is that because I was using a different version with what you are using?

GPBoostError: Check failed: 0 < ind_effect_group_rand_coef[j] && ind_effect_group_rand_coef[j] <= num_re_group_ at /home/whsigris/Dropbox/HSLU/Projects/MixedBoost/GPBoost/python-package/compile/include/GPBoost/re_model_template.h, line 231 .
fabsig commented 3 years ago

I don't think the version is an issue here. The reason might be that np.ones() gives doubles by default but, in the end, integers are needed. Can you try again with the code below (added dtype=int to np.ones())? Maybe you can also double check that np.ones(X.shape[1], dtype=int) is indeed array([1, 1]).

If this does also not work, another potential issue might be the way integers are handled in C++. Which operating system are you using? I could try to find a solution for this.

Many thanks for your help!

import gpboost as gpb
import numpy as np

n = 1000  # number of samples
# Simulate single level grouped random effects data
m = 100  # number of categories / levels for grouping variable
np.random.seed(1)
group = np.arange(n)  # grouping variable
for i in range(m):
    group[int(i * n / m):int((i + 1) * n / m)] = i
b = 1 * np.random.normal(size=m)  # simulate random effects
eps = b[group]
X = np.column_stack(
    (np.ones(n), np.random.uniform(size=n)))  # design matrix / covariate data for fixed effect
beta = np.array([0, 3])  # regression coefficents
xi = np.sqrt(0.01) * np.random.normal(size=n)  # simulate error term
y = eps + xi + X.dot(beta) # observed data

gp_model = gpb.GPModel(group_data=group, group_rand_coef_data=X, ind_effect_group_rand_coef=np.ones(X.shape[1], dtype=int))
gp_model.fit(y=y)
gp_model.summary()
YuanyuanLi96 commented 3 years ago

I copied the code above and run it, but got the same error again:

GPBoostError: Check failed: 0 < ind_effect_group_rand_coef[j] && ind_effect_group_rand_coef[j] <= num_re_group_ at /home/whsigris/Dropbox/HSLU/Projects/MixedBoost/GPBoost/python-package/compile/include/GPBoost/re_model_template.h, line 231 .

My computer is using macOS Big Sur version 11.4. Thanks for your support, and hope this issue could be fixed !

fabsig commented 3 years ago

I can reproduce this now on Mac OS. There was a technical issue in the Python / C++ interaction due to the fact that integers can be represented in different formats. This should be fixed now with the newly released version 0.6.4 (on PyPI).

Thank you for reporting this!

YuanyuanLi96 commented 3 years ago

Thanks for updating the package, I can run above example successfully. However, if I modified the example using likelihood = "bernoulli_probit", I got the "IndexError: list index out of range" again, which is the same error I got when I run on my own data. Could you check if anything wrong in the example below?

import gpboost as gpb
import numpy as np

n = 1000  # number of samples
# Simulate single level grouped random effects data
m = 100  # number of categories / levels for grouping variable
np.random.seed(1)
group = np.arange(n)  # grouping variable
for i in range(m):
    group[int(i * n / m):int((i + 1) * n / m)] = i
b = 1 * np.random.normal(size=m)  # simulate random effects
eps = b[group]
X = np.column_stack(
    (np.ones(n), np.random.uniform(size=n)))  # design matrix / covariate data for fixed effect
beta = np.array([0, 3])  # regression coefficents
probs = stats.norm.cdf(X.dot(beta) + eps)
y = np.random.uniform(size=n) < probs
y = y.astype(np.float64)
gp_model = gpb.GPModel(group_data=group, likelihood = "bernoulli_probit", group_rand_coef_data=X, ind_effect_group_rand_coef=np.ones(X.shape[1], dtype=int))
fabsig commented 3 years ago

This should be fixed now with version 0.6.5.

YuanyuanLi96 commented 3 years ago

Thanks for your update, it works!