Open mihai-spire opened 4 years ago
could the above fit the coregionalized multi-output GP paradigm as described here
(and in https://gpflow.readthedocs.io/en/master/notebooks/advanced/coregionalisation.html)
This is funny, I think we have like exactly this kind of model as one of our examples for how to do pyro integration.
Is this basically what you are looking for? https://github.com/cornellius-gp/gpytorch/blob/master/examples/07_Pyro_Integration/Clustered_Multitask_GP_Regression.ipynb
@jacobrgardner thank you for your reply!
yeah i went through this example before - i'm not sure if it's exactly what i need for my problem...
i have N_s
samples for each station s
(s
= AAAA
, BBBB
, ...), with N_1 != N_2 != ... != N_s
. I tried to define a clustered multitask model like
with S
tasks (= no of stations). But the input is not the same for all tasks - which - as far as i understand - is a prerequisite for multitask GP methods, right? Also, my target vectors for each station are of different length (N_s
). How do I fit this to the clustered multitask GP? Doesn't that expect a 2D target tensor when training, with each column being the output of a task? At least that's what I understand from the model code
class ClusterMultitaskGPModel(gpytorch.models.pyro.PyroGP):
def __init__(self, train_x, train_y, num_functions=2, reparam=False):
num_data = train_y.size(-2) <---- number of samples
# (...)
likelihood = ClusterGaussianLikelihood(train_y.size(-1), num_functions) <--- (num_tasks, num_clusters)
Hi there,
I have a dataset of weather measurements taken at various stations (ca. 5000 or so) across the globe. It has the following (simplified) form:
I'd like to build a GP model for the true temperature
TRUE_TEMP
from the temperature and pressure forecast data. Now, I suspect there is a natural cluster structure in the station data - some stations will be similar to others (e.g. stations that are near one another, same latitude, etc...). How can I use thegpytorch
+pyro
functionality to build these clusters probabilistically from the data?I've read through the multitask GP clustering example carefully, but I don't think that really applies here: if I define each station as a "task" and learn, say, 10 clusters, then the tasks don't share any inputs so there's no learning done across tasks.
Conversely, I could train 5000 independent GP models (one GP / station), but how do I do the clustering afterwards? Also, what happens if I bring in new data at a "new" station (i.e. one that wasn't in the original dataset)? I'd like my model to figure out automatically which cluster(s) those samples belong to.
Note that the sample counts are different for each station (there's some data imbalance too, as some stations have considerably more data samples than others).
Thanks in advance!