cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.54k stars 557 forks source link

[Question] What's the best way to model a set of curves with GPyTorch #1913

Open AndreaPi opened 2 years ago

AndreaPi commented 2 years ago

Hi,

I have a small dataset (N ~ 2000) of samples (x, y) where x ∈ ℝ²⁰, with very little noise. The 2000 samples are not completely independent, in the sense that they can be divided in 50 groups of 4 curves, each corresponding to a specific individual. For each individual, the 4 curves are pretty correlated. They look something like this, even though this is an artificial example and the correlation is greatly exaggerated for illustrative purposes:

image

One of the components of x is the x-coordinate in this plot, while another component of x denotes which of the 4 curves we're describing. Which would be the best way to model this kind of data, in your opinion? I could just consider the N samples as independent, and regress y over x, but I would be losing the specific sequence structure of these data. Do you have other suggestions? Thanks!

Balandat commented 2 years ago

You could try a standard multi-task GP model here where the individual would be the “task”

AndreaPi commented 2 years ago

@Balandat thanks for the answer. I don't know about multitask GPs, but if I look at the paper, I'm not sure how I would go about modeling my problem. In the following, (x_ijk, y_ijk) denotes the k-th point of the j-th curve of the i-th group of curves. So for example (x_123, y_123) is the third point of the second curve of the first group of 4 curves. Basically, I have a dataset like this:

First group of 4 curves (individual #1):

(x_111, y_111), (x_112, y_112),...(x_11m, y_11m), (x_11m, y_11m) 
(x_121, y_121), (x_122, y_122),...(x_12m, y_12m), (x_12m, y_12m) 
(x_131, y_131), (x_132, y_132),...(x_13m, y_13m), (x_13m, y_13m) 
(x_141, y_141), (x_142, y_142),...(x_14m, y_14m), (x_14m, y_14m) 

Second group of 4 curves (individual #2):

(x_211, y_211), (x_212, y_212),...(x_21m, y_21m), (x_21m, y_21m) 
(x_221, y_221), (x_222, y_222),...(x_22m, y_22m), (x_22m, y_22m) 
(x_231, y_231), (x_232, y_232),...(x_23m, y_23m), (x_23m, y_23m) 
(x_241, y_241), (x_242, y_242),...(x_24m, y_24m), (x_24m, y_24m) 

. . . N-th group of 4 curves ((individual #N)):

(x_N11, y_N11), (x_N12, y_N12),...(x_N1m, y_N1m), (x_N1m, y_N1m) 
(x_N21, y_N21), (x_N22, y_N22),...(x_N2m, y_N2m), (x_N2m, y_N2m) 
(x_N31, y_N31), (x_N32, y_N32),...(x_N3m, y_N3m), (x_N3m, y_N3m) 
(x_N41, y_N41), (x_N42, y_N42),...(x_N4m, y_N4m), (x_N4m, y_N4m) 

In other words, the x are not the same, even for samples belonging to the same individual. How would I use multitask GPs to fit this dataset, and to make predictions on a new individual? Note that for simplicity, I didn't include the features which identify an individual. But if we want to be more precise, let Z bet the vector of features that identify an individual as such. Then the dataset would be

(Z_1, x_111, y_111), (Z_1, x_112, y_112),...(Z_1, x_11m, y_11m), (Z_1, x_11m, y_11m) 
(Z_1, x_121, y_121), (Z_1, x_122, y_122),...(Z_1, x_12m, y_12m), (Z_1, x_12m, y_12m) 
(Z_1, x_131, y_131), (Z_1, x_132, y_132),...(Z_1, x_13m, y_13m), (Z_1, x_13m, y_13m) 
(Z_1, x_141, y_141), (Z_1, x_142, y_142),...(Z_1, x_14m, y_14m), (Z_1, x_14m, y_14m) 
.
.
.
(Z_N, x_N11, y_N11), (Z_N, x_N12, y_N12),...(Z_N, x_N1m, y_N1m), (Z_N, x_N1m, y_N1m) 
(Z_N, x_N21, y_N21), (Z_N, x_N22, y_N22),...(Z_N, x_N2m, y_N2m), (Z_N, x_N2m, y_N2m) 
(Z_N, x_N31, y_N31), (Z_N, x_N32, y_N32),...(Z_N, x_N3m, y_N3m), (Z_N, x_N3m, y_N3m) 
(Z_N, x_N41, y_N41), (Z_N, x_N42, y_N42),...(Z_N, x_N4m, y_N4m), (Z_N, x_N4m, y_N4m) 

Then, prediction with this model would mean to feed the model a new Z, and obtain in output the 4 curves for the new individual. Or if this is too complex (since the model would also have to learn the "locations" of the xs for the 4 curves corresponding to the new individual), I could feed the model both Z, the curve index j and the abscissas x_j1,...,x_jm, and get the y_j1,...y_jm values back. How could I model this with multitask GPs?

Balandat commented 2 years ago

So for a Hadamard-type multi-task model (rather than a Kronecker-type one) you don't need to have the observations of the different tasks at the same locations. In a sense there are two tasks here - the curve (c) and the individual (i). You could also use a kernel of the form K((x1, c1, i1), (x2, c2, i2)) = K_x(x1, x2) * K_c(c1, c2) * K_i(i1, i2) in which case K_c models the cross-curve correlation and K_i models the cross-individual correlation. I think the challenge you'll run into though with this is that if m is not small this will result in large covariance matrices that don't have a particular structure, and so as a result this model will be very expensive to fit.

AndreaPi commented 2 years ago

@Balandat thanks for the answer!

  1. Is there an example of this Hadamard-type multi-task model?
  2. c and i are categorical variables with support [0,1,2,3], and [1..50] respectively, right?
  3. m is not too large. It ranges between 7 and 13: unfortunately it's not always the same for different curves and individuals....I hoped it would be a constant, but after a closer examination of the full dataset, I found out it really isn't.
Balandat commented 2 years ago
  1. https://github.com/cornellius-gp/gpytorch/blob/master/examples/03_Multitask_Exact_GPs/Hadamard_Multitask_GP_Regression.ipynb
  2. Yes
  3. You may be ok then - it's going to take a while to fit the model but if you don't need it to be very fast it could work. With the Hadamard setup you don't need to have the same data points for each individual/curve, see the example nb for the interface.