cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.52k stars 554 forks source link

probabilistic clustering with GPs and pyro #1306

Open mihai-spire opened 3 years ago

mihai-spire commented 3 years ago

Hi there,

I have a dataset of weather measurements taken at various stations (ca. 5000 or so) across the globe. It has the following (simplified) form:

Obs_Index   TEMPERATURE_FORECAST   PRESSURE_FORECAST   STATION_ID   TRUE_TEMP
1           20.3                   1000.45             AAAA         19.2 
2           15.2                   1023.23             BBBB         16.0
3           -5.2                 1020.13           CCCC         -6.0
4           15.2                 1023.23           BBBB         15.0
5           -2.2                 1013.3            CCCC         -3.0
6           11.2                 1017.5            AAAA         10.0
(...)

I'd like to build a GP model for the true temperature TRUE_TEMP from the temperature and pressure forecast data. Now, I suspect there is a natural cluster structure in the station data - some stations will be similar to others (e.g. stations that are near one another, same latitude, etc...). How can I use the gpytorch + pyro functionality to build these clusters probabilistically from the data?

I've read through the multitask GP clustering example carefully, but I don't think that really applies here: if I define each station as a "task" and learn, say, 10 clusters, then the tasks don't share any inputs so there's no learning done across tasks.

Conversely, I could train 5000 independent GP models (one GP / station), but how do I do the clustering afterwards? Also, what happens if I bring in new data at a "new" station (i.e. one that wasn't in the original dataset)? I'd like my model to figure out automatically which cluster(s) those samples belong to.

Note that the sample counts are different for each station (there's some data imbalance too, as some stations have considerably more data samples than others).

Thanks in advance!

mihai-spire commented 3 years ago

could the above fit the coregionalized multi-output GP paradigm as described here

https://docs.gpytorch.ai/en/latest/examples/04_Variational_and_Approximate_GPs/SVGP_Multitask_GP_Regression.html?highlight=coregionalization#Types-of-Variational-Multitask-Models

(and in https://gpflow.readthedocs.io/en/master/notebooks/advanced/coregionalisation.html)

jacobrgardner commented 3 years ago

This is funny, I think we have like exactly this kind of model as one of our examples for how to do pyro integration.

Is this basically what you are looking for? https://github.com/cornellius-gp/gpytorch/blob/master/examples/07_Pyro_Integration/Clustered_Multitask_GP_Regression.ipynb

mihai-spire commented 3 years ago

@jacobrgardner thank you for your reply!

yeah i went through this example before - i'm not sure if it's exactly what i need for my problem...

i have N_s samples for each station s (s = AAAA, BBBB, ...), with N_1 != N_2 != ... != N_s. I tried to define a clustered multitask model like

https://github.com/cornellius-gp/gpytorch/blob/master/examples/07_Pyro_Integration/Clustered_Multitask_GP_Regression.ipynb

with S tasks (= no of stations). But the input is not the same for all tasks - which - as far as i understand - is a prerequisite for multitask GP methods, right? Also, my target vectors for each station are of different length (N_s). How do I fit this to the clustered multitask GP? Doesn't that expect a 2D target tensor when training, with each column being the output of a task? At least that's what I understand from the model code

class ClusterMultitaskGPModel(gpytorch.models.pyro.PyroGP):
    def __init__(self, train_x, train_y, num_functions=2, reparam=False):
        num_data = train_y.size(-2)    <----  number of samples
        # (...)
        likelihood = ClusterGaussianLikelihood(train_y.size(-1), num_functions)   <--- (num_tasks, num_clusters)