JuliaGaussianProcesses / TemporalGPs.jl

Fast inference for Gaussian processes in problems involving time. Partly built on results from https://proceedings.mlr.press/v161/tebbutt21a.html
MIT License
120 stars 5 forks source link

Simple example w/ real time-series data? #74

Open azev77 opened 3 years ago

azev77 commented 3 years ago

Hi and thank you for this package! Is it possible to provide a simple example w/ real time-series data in the README?

For example: TSAnalysis.jl gives several real data examples in the readme...

willtebbutt commented 3 years ago

This sounds like a good idea. If you're interested in contributing, I'm happy to review a PR. I'll try and add an example or two myself over the next couple of days though.

azev77 commented 3 years ago

I'll be happy to add an example once I see an example of how to use this package.

I created the following simple dataset (US annual GDP growth) which I'd like to use w/ your package.

using HTTP, DataFrames, CSV
link = "https://bit.ly/3vtTaW5"
r = HTTP.get(link)
df = CSV.read(r.body, DataFrame, missingstring="NA")

Suppose I have GDP data from 1948-1968. How can I use TemporalGPs.jl to obtain a prediction density for GDP growth in 1969?

willtebbutt commented 3 years ago

Great. TemporalGPs implements the AbstractGPs.jl API, so using it is essentially the same as using that. We've also not managed to document that fantastically, so here's a worked example. It's not a great model (you'd want to optimise the kernel if you were doing this seriously), but it demonstrates how to use the package:

# Load dataset.
using CSV
using DataFrames
using HTTP

link = "https://bit.ly/3vtTaW5";
r = HTTP.get(link);
df = CSV.read(r.body, DataFrame; missingstring="NA");

using AbstractGPs
using Plots
using TemporalGPs

# Specify training data. AbstractGPs doesn't explicitly support dates, although this is
# certainly something we could (probably should) add. It would be straightforward to do so.
x = RegularSpacing(0.0, 1.0, length(df.date));
y = df.value;

# Build a GP with fixed kernel. You would want to learn them in practice in any serious setting.
# See the README for an example using Optim + Zygote for this.
# This probably won't be a very good model as-is -- it's the most basic thing you could do.
f = to_sde(GP(Matern52Kernel()), SArrayStorage(Float64));
f_post = posterior(f(x, 1e-3), y);

# Specify locations at which to make predictions. I've chosen to make predictions at
# all of the training data, and for 5 years into the future. If you just wanted a single year
# into the future, that would be fine too.
x_predict = RegularSpacing(0.0, 1.0, length(df.date) + 5);

# If you just want the posterior marginals, you would write something like
marginals(f_post(x_predict))

# If you want a sample from the posterior, you would write something like
rand(f_post(x_predict))

# We have a plotting recipe that will plot the marginals.
plot(f_post(x_predict))
scatter!(x, y)

# For the length of time series you've provided, it's probably not necessary to use
# TemporalGPs -- the standard approach to inference in a GP will do just fine.
# You can do this is in almost exactly the same way as using TemporalGPs.
# The only thing that has changed is there's no need to call `to_sde`.
f = GP(Matern52Kernel());

# Everything else from above is the same.

Let me know whether this is clear!

edit: RegularSpacing is explained in the README, but it's really just a Zygote-friendly simplified version of range. Hopefully it'll disappear at some point in the future in favour of just using a range.

azev77 commented 3 years ago

Thanks! I got the following to work:

# Load dataset.
using HTTP, CSV, DataFrames, Dates
link = "https://bit.ly/3vtTaW5";
r = HTTP.get(link);
df = CSV.read(r.body, DataFrame; missingstring="NA");

# plot ts data 
using Plots
plot(year.(df.date), df.value, lab = "");
scatter!(year.(df.date), df.value, lab = "US Real GDP growth, pch1");
hline!([0], color = "red", lab = "")

image

using AbstractGPs, TemporalGPs
# Specify training data.
x = RegularSpacing(0.0, 1.0, length(df.date));
y = df.value;

# Specify locations at which to make predictions.
x_predict = RegularSpacing(0.0, 1.0, length(df.date) + 5); 

# Build a GP with fixed kernel.
f_naive = GP(Matern52Kernel());
f = to_sde(f_naive, SArrayStorage(Float64));
f_post = posterior(f(x, 1e-3), y); # get posterior dist having observed y @ x.

marginals(f_post(x_predict)) # Posterior marginals 
rand(f_post(x_predict))      # Sample from posterior 
logpdf(f_post(x), y)         # 183.95 posterior log predictive prob of `y`

# Posterior marginals for out-of sample GDP
# Observe: bigger forecast horizon => bigger σ 
marginals(f_post(x_predict))[end-4:end]
""" 
5-element Vector{Distributions.Normal{Float64}}:
 Distributions.Normal{Float64}(μ=-2.5304528896537226, σ=0.8354470986323949)
 Distributions.Normal{Float64}(μ=-0.7158798705732248, σ=0.9888083046711964)
 Distributions.Normal{Float64}(μ=-0.14706868329661496, σ=0.9995439581936363)
 Distributions.Normal{Float64}(μ=-0.02571225001856308, σ=0.9999862705923213)
 Distributions.Normal{Float64}(μ=-0.004078352668396576, σ=0.9999996576590127)
"""

px = x_predict[end-4:end]
pm = mean.(marginals(f_post(x_predict))[end-4:end])
ps = std.(marginals(f_post(x_predict))[end-4:end])
plb = pm - 1.96*ps
pub = pm + 1.96*ps

# Plotting recipe
plot(f_post(x_predict), lab = "");
xticks!((0:8:72, string.(year.(df.date))[1:8:73]));
scatter!(x, y,  lab = "US GDP growth: observed");
scatter!(px, pm, lab = "US GDP growth: mean prediction");
#scatter!(px,plb,lab="") 
#scatter!(px,pub,lab="") 
hline!([0], color = "red", lab = "")

image This naive model predicts it will take ~5 years for US GDP growth to reach 0% post the COVID recession (lol).

My background, I know a bit about forecasting & ML. I'm new to forecasting methods that return an entire distribution. I know nothing about GPs (yet).

Some questions (me thinking out loud):

  1. is there a posterior joint distribution?
  2. is there a convenient way to automate out-of-sample-cross validation in the example above? Eg: we train for 20 year windows & forecast 5 years ahead beginning 1948 xtrain_1 = 1951-1970, predict dist 1971-1975, compute NLL of observed y in 1971-1975 xtrain_2 = 1952-1971, predict dist 1972-1976, compute NLL of observed y in 1972-1976 ... xtrain_30 = 1980-1999, predict dist 2000-2004, compute NLL of observed y in 2000-2004
  3. are only normals allowed for posterior marginals? (is that what you refer to in the On-going work part of readme?)
  4. A recent paper makes 1 year ahead predictions for the distribution of GDP growth, with current GDP growth & NFCI as conditioning variables. Their data & Matlab code is here. Target variable y_{t+1} = GDP_Growth_{t+1} predictors X_t =(y_t, NFCI_t) = (GDP_Growth_{t}, NFCI_t) Goal: predict the conditional distribution P(y_{t+1} | y_{t}, NFCI_t ) image The authors use quantile regression. Is it possible to produce this kind of figure w/ your package? Are GPs good at this type of thing rel to NGBoost.py?
willtebbutt commented 3 years ago

is there a posterior joint distribution?

Yes, it's a multivariate Gaussian for any (finite) collection of inputs, and given by f_post(x_predict). f_post(x_predict) is an example of a FiniteGP (not sure it's the best name), and it's just a multivariate Normal distribution. So you can call rand on it, marginals is a thing we added in the AbstractGPs API to make getting the marginals easy (because they're a really common thing to want). You can even ask for logpdf(f_post(x_predict), y_predict) to get the posterior log density of an entire collection of predictions.

is there a convenient way to automate out-of-sample-cross validation in the example above?

Hmm that's not something we've built in -- it's the kind of thing I would hope a generic time series analysis package would offer, and that we could hook into. Do you happen to know if this kind of functionality exists elsewhere in Julia's time series analysis ecosystem? It's not something I've looked into as much as I should have.

are only normals allowed for posterior marginals? (is that what you refer to in the On-going work part of readme?)

Yes and no. If you use the basic functionality (as shown above) then yes. But the things in this package (and the AbstractGPs ecosystem more generally) are best thought of as a toolkit that you can plug into other things -- for example, you can plug AbstractGPs into Turing and Soss to produce more complicated, non-Gaussian, models.

In addition to generic probabilistic programming tools, there are lots of GP-specific tools for handling non-Gaussian likelihoods. They'll hopefully appear in other packages within the JuliaGaussianProcesses org at some point, but I don't have a particular timeline for when they're likely to be available / mature unfortunately.

Is it possible to produce this kind of figure w/ your package?

I suspect that the best way to do this is to get the marginals, and then plug them into a generic plotting tool that produces this kind of figure. I've forgotten the generic name for this kind of figure, but I feel like I've seen them floating around the Julia ecosystem somewhere before, but I can't find an example now. Perhaps someone on slack or discourse would know where to find code to produce it? (I can't see an example in StatsPlots.jl

Are GPs good at this type of thing rel to NGBoost.py?

It will really depend on the problem -- I've actually not seen NGBoost before, so thanks for pointing it out. Unfortunately, the best advice I can give is to try both out on your problem and see which wins.

azev77 commented 3 years ago

I've actually been pretty excited about the emerging lit on probabilistic forecasting. FYI, here are some of the options I previously stored:

Python: NGBoost, CatBoostLSS. R: Gamlss, XGBoostLSS, GamboostLSS, bamlss , disttree XGBoostLSS plans a Julia implementation.

The NGBoost paper compares the performance of several methods in Table 1: image

I feel like probabilistic forecasting could work nicely in Julia. I wanna benchmark some of the Julia options as well. Let me know if you know of any Julia options for probabilistic forecasting.