SheffieldML / GPy

Gaussian processes framework in python
BSD 3-Clause "New" or "Revised" License
2.04k stars 562 forks source link

How do I include "distances" from matrix #629

Open Erhanjinn opened 6 years ago

Erhanjinn commented 6 years ago

Hello,

I would like to use GPy to perform a classical regression task using a RBF kernel. However the "distance" between two datapoints takes very long to compute in my case (I am using a web API to pull it from a third party service).

Therefore I got an idea to precompute these distances and save them to a matrix.

I am using a set of 10 000 data points. Now I have a 10k times 10k symmetric matrix of distances between the points.

What I would like to know is how can I include such information in your module?

Could you please give me any tips?

Thank you very much in advance!

smcveigh-phunware commented 6 years ago

Finally a question I can answer.

However the "distance" between two datapoints takes very long to compute in my case (I am using a web API to pull it from a third party service)

😨 never do this; Computation time on GPs are already bad enough O(N^3) for full process and O(M^2*N) wherein M < N for sparse process.

Therefore I got an idea to precompute these distances and save them to a matrix

Just save the data points offline an define your distance metric.

Check out:

The standard definition of "distance" for RBF (and the other kernels) is euclidean.

Erhanjinn commented 6 years ago

Thank you for your quick answer.

Unfortunately, I can’t do the things you proposed. I have the datapoints saved locally, what am I pulling from web is a car route distance between the two GPS points. Without constructing the map myself I will not be able to compute this. I have already tried to take just euclidean distance, but I would like to go for the “real” distance to improve my regression result (or at least I hope it will).

Maybe, I might use an offline routing app, but that would still be very slow I think.

Of course, I know GPs are awfully slow, that’s why I am asking how to speed things up :-) for all my 10k points I have already pulled the pairwise distances from the third party server (now stored in a matrix) and now I would like to pass them through GPy machinery.

If this would not be possible in any way, I think I will go with constructing the large covariance matrix K myself and optimizing the hyperparameters the hard way - for example try some vector of values for each of hyperparameters and then taking the best model according to cross validation testing.

smcveigh-phunware commented 6 years ago

I see your dilemma; I suggest looking at gpy.kern.Stationary.K from which RBF inherits. Inside you'll see self._scaled_dist(X, X2). A a quick solution, you could initialize your subclass of a stationary (i.e. RBF) kernel from a pandas DataFrame containing your offline distance calculations from a lookup of (X1, X2) and if the lookup fails, query the URL and store the new result internally so you don't have to lookup again. If you want to get fancy you could parallelize the URL queries across rows of X1, X2 with asyncio. As a note, make sure your distances you query are normalized (like a 'D_norm' column if other columns are 'X1', 'X2', 'D') by the maximum route distance. You'll always need the functionality to query distance for 'X1' and 'X2' outside of the training data for the prediction of new Y.

You could also consider using an offline library in the same place using open street map data.

Erhanjinn commented 6 years ago

Ok, thank you for the hints.

I will take a look at the proposed solution. That means I would have to delve much deeper than my current knowledge of GPy.

Thanks again :-)

lionfish0 commented 6 years ago

It's also worth thinking about whether the covariance matrix based on such distances will remain positive definite.

On Tue, 24 Apr 2018, 22:54 Erhanjinn, notifications@github.com wrote:

Ok, thank you for the hints.

I will take a look at the proposed solution. That means I would have to delve much deeper than my current knowledge of GPy.

Thanks again :-)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SheffieldML/GPy/issues/629#issuecomment-384092441, or mute the thread https://github.com/notifications/unsubscribe-auth/AHjDQIlcA3gCGa5A0o0p-1_0sjVbvSlLks5tr58QgaJpZM4Thjka .

tommylees112 commented 5 years ago

Dear @smcveigh-phunware I was reading this thread and saw your reply:

If your working with geocoordinates, you'll need to look into map projections. This is not as bad as it seems, I can provide more info if this is your use case.

I am working with geocordinates from a netcdf, .nc file. I want to calculate the distance matrix for my lat/lon grid.

Here is the data: one_timestep_lst.nc.zip

$ ncdump -h one_timestep_lst.nc

netcdf one_timestep_lst {
dimensions:
    lon = 600 ;
    lat = 600 ;
variables:
    float lon(lon) ;
        lon:_FillValue = NaNf ;
        lon:standard_name = "longitude" ;
        lon:long_name = "longitude coordinate" ;
        lon:units = "degrees_east" ;
        lon:axis = "X" ;
    float lat(lat) ;
        lat:_FillValue = NaNf ;
        lat:standard_name = "latitude" ;
        lat:long_name = "latitude coordinate" ;
        lat:units = "degrees_north" ;
        lat:axis = "Y" ;
    double time ;
        time:_FillValue = NaN ;
        time:standard_name = "time" ;
        time:axis = "T" ;
        time:units = "days since 2000-01-01" ;
        time:calendar = "standard" ;
    short lst_day(lat, lon) ;
        lst_day:_FillValue = 30000s ;
        lst_day:long_name = "land_surface_temperature day" ;
        lst_day:units = "degC" ;
        lst_day:_fillvalue = 20000s ;
        lst_day:coordinates = "time" ;
        lst_day:add_offset = 0.f ;
        lst_day:scale_factor = 0.01f ;
}

I am trying to calculate the distance between the points and there is definitely a smarter way of doing this than looping through each lat/lon pair.

import numpy as np
import xarray as xr
from scipy.spatial import distance_matrix

da = xr.open_dataarray("one_timestep_lst.nc")

# create meshgrid of lons/lats
xx,yy = np.meshgrid(da.lon.values, da.lat.values)

positions = np.vstack([xx.ravel(), yy.ravel()]).T
print(positions[:5,])

dist_mat = distance_matrix(positions, positions)

This is super inefficient because the gird is evenly spaced and symmetric, so it kills my computer. Also I'm pretty certain that the distance measure isn't accurate because I haven't taken into account any projections in the calculations of the distances.

Thanks for your help!

Tommy