Open AndreaPi opened 2 years ago
You could try a standard multi-task GP model here where the individual would be the “task”
@Balandat thanks for the answer. I don't know about multitask GPs, but if I look at the paper, I'm not sure how I would go about modeling my problem. In the following, (x_ijk, y_ijk)
denotes the k-th point of the j-th curve of the i-th group of curves. So for example (x_123, y_123)
is the third point of the second curve of the first group of 4 curves. Basically, I have a dataset like this:
First group of 4 curves (individual #1):
(x_111, y_111), (x_112, y_112),...(x_11m, y_11m), (x_11m, y_11m)
(x_121, y_121), (x_122, y_122),...(x_12m, y_12m), (x_12m, y_12m)
(x_131, y_131), (x_132, y_132),...(x_13m, y_13m), (x_13m, y_13m)
(x_141, y_141), (x_142, y_142),...(x_14m, y_14m), (x_14m, y_14m)
Second group of 4 curves (individual #2):
(x_211, y_211), (x_212, y_212),...(x_21m, y_21m), (x_21m, y_21m)
(x_221, y_221), (x_222, y_222),...(x_22m, y_22m), (x_22m, y_22m)
(x_231, y_231), (x_232, y_232),...(x_23m, y_23m), (x_23m, y_23m)
(x_241, y_241), (x_242, y_242),...(x_24m, y_24m), (x_24m, y_24m)
. . . N-th group of 4 curves ((individual #N)):
(x_N11, y_N11), (x_N12, y_N12),...(x_N1m, y_N1m), (x_N1m, y_N1m)
(x_N21, y_N21), (x_N22, y_N22),...(x_N2m, y_N2m), (x_N2m, y_N2m)
(x_N31, y_N31), (x_N32, y_N32),...(x_N3m, y_N3m), (x_N3m, y_N3m)
(x_N41, y_N41), (x_N42, y_N42),...(x_N4m, y_N4m), (x_N4m, y_N4m)
In other words, the x are not the same, even for samples belonging to the same individual. How would I use multitask GPs to fit this dataset, and to make predictions on a new individual? Note that for simplicity, I didn't include the features which identify an individual. But if we want to be more precise, let Z bet the vector of features that identify an individual as such. Then the dataset would be
(Z_1, x_111, y_111), (Z_1, x_112, y_112),...(Z_1, x_11m, y_11m), (Z_1, x_11m, y_11m)
(Z_1, x_121, y_121), (Z_1, x_122, y_122),...(Z_1, x_12m, y_12m), (Z_1, x_12m, y_12m)
(Z_1, x_131, y_131), (Z_1, x_132, y_132),...(Z_1, x_13m, y_13m), (Z_1, x_13m, y_13m)
(Z_1, x_141, y_141), (Z_1, x_142, y_142),...(Z_1, x_14m, y_14m), (Z_1, x_14m, y_14m)
.
.
.
(Z_N, x_N11, y_N11), (Z_N, x_N12, y_N12),...(Z_N, x_N1m, y_N1m), (Z_N, x_N1m, y_N1m)
(Z_N, x_N21, y_N21), (Z_N, x_N22, y_N22),...(Z_N, x_N2m, y_N2m), (Z_N, x_N2m, y_N2m)
(Z_N, x_N31, y_N31), (Z_N, x_N32, y_N32),...(Z_N, x_N3m, y_N3m), (Z_N, x_N3m, y_N3m)
(Z_N, x_N41, y_N41), (Z_N, x_N42, y_N42),...(Z_N, x_N4m, y_N4m), (Z_N, x_N4m, y_N4m)
Then, prediction with this model would mean to feed the model a new Z, and obtain in output the 4 curves for the new individual. Or if this is too complex (since the model would also have to learn the "locations" of the xs for the 4 curves corresponding to the new individual), I could feed the model both Z, the curve index j and the abscissas x_j1,...,x_jm, and get the y_j1,...y_jm values back. How could I model this with multitask GPs?
So for a Hadamard-type multi-task model (rather than a Kronecker-type one) you don't need to have the observations of the different tasks at the same locations. In a sense there are two tasks here - the curve (c
) and the individual (i
). You could also use a kernel of the form K((x1, c1, i1), (x2, c2, i2)) = K_x(x1, x2) * K_c(c1, c2) * K_i(i1, i2)
in which case K_c
models the cross-curve correlation and K_i
models the cross-individual correlation. I think the challenge you'll run into though with this is that if m
is not small this will result in large covariance matrices that don't have a particular structure, and so as a result this model will be very expensive to fit.
@Balandat thanks for the answer!
c
and i
are categorical variables with support [0,1,2,3], and [1..50] respectively, right? m
is not too large. It ranges between 7 and 13: unfortunately it's not always the same for different curves and individuals....I hoped it would be a constant, but after a closer examination of the full dataset, I found out it really isn't.
Hi,
I have a small dataset (N ~ 2000) of samples (x, y) where x ∈ ℝ²⁰, with very little noise. The 2000 samples are not completely independent, in the sense that they can be divided in 50 groups of 4 curves, each corresponding to a specific individual. For each individual, the 4 curves are pretty correlated. They look something like this, even though this is an artificial example and the correlation is greatly exaggerated for illustrative purposes:
One of the components of x is the x-coordinate in this plot, while another component of x denotes which of the 4 curves we're describing. Which would be the best way to model this kind of data, in your opinion? I could just consider the N samples as independent, and regress y over x, but I would be losing the specific sequence structure of these data. Do you have other suggestions? Thanks!