option to downsample the data to calculate the variogram

kvanlombeek commented 7 years ago

When you krige with large datasets (100k datapoints for example), the number of pairwise distances blow up. It would be nice if there is a parameter in the function like _max_n_forvariogram that limits the number of datapoints used.

I have personally already implemented this in the ordinary kriging function as our dataset was too large.

rth commented 7 years ago

@kvanlombeek Would this be a different functionality than the n_closest_points parameter in ordinary kriging? It was implemented in PR https://github.com/bsmurphy/PyKrige/pull/5 also to reduce the amount of calculated pairwise distances (also using cKDTree for nearest neighbor calculations)...

Edit: ok, so the goal is to use only a subset of data for the variogram, I missed the description in the issue title )

basaks commented 7 years ago

I am keen to understand the opinion of the kriging community on this topic. As identified by @rth we can already use n_closest_points using OrdrinaryKring. Amongst other things, it makes the prediction orders or magnitude faster for (large number of target points), and I think it is also more accurate as we essentially want to capture the local effects while we are kriging over a very large area (over many targets).

Now down sampling I would think will diminish the krige prediction confidence interval (variance will increase) locally. I can not immediately see the value as to why you would want to do this compared to the n_closest_points approach.

kvanlombeek commented 7 years ago

I believe they are two different things.

I am talking about downsampling the data just to calculate the variogram, not to actualy Krige.

Calculating the variogram is a coslty operation as you have to calculate all the pairwise distances. You don't loose a lot of information if you fit the variogram on only a 1000 points for example. (the fitting of the variogram implies only calculating three parameters, sill, nugget and range).

The n_closest_points is usefull when you do the actual interpolation. What this option does is to limit the number of surrounding points that you take into account when you actually krige. The variogram is by that time already calculated.

So by no means I have the intention to drop values, on the contrary, the updates I have made locally are to make sure the kriging algorithm works with a large dataset so that I can extract all local effects. (on www.rockestate.be you can see our kriged map)

basaks commented 7 years ago

OK @kvanlombeek It makes a bit more sense now. Just looking for some confirmation here. You are saying that this function, which is used to calculate the variogram model, will be cheaper to compute if we sample the observations. Specifically you are saying that you want to reduce the size of these arrays in the code in core.py lines 124-126:

x1, x2 = np.meshgrid(x, x)
y1, y2 = np.meshgrid(y, y)
z1, z2 = np.meshgrid(z, z)

Sure that will be fine. In this case then sampling (x/y/z) is equivalent to a stochastic approach in a large learning problem.

Did I get that right?

bsmurphy commented 7 years ago

Yes, I see how this can be a problem. Down-sampling would certainly be useful, but I would worry that the (random) sample might not be representative of the entire dataset (assuming stationarity of the entire dataset). Maybe we could randomly sub-sample the data multiple times and each time estimate the variogram, then compare all resulting variograms in order to figure out the most representative model (kind of like the RANSAC algorithm). Maybe there's some approach from the machine learning literature that could be implemented here to accomplish this and to make the computations efficient. @basaks, seems like you're familiar with this kind of machine learning analysis, what do you think?

basaks commented 7 years ago

@bsmurphy @kvanlombeek When there are a lot of observations, downsampling may provide a very useful boost to performance without losing much accuracy. However, I don't think it is necessary to have this built in with this package.

One can manually downsample outside PyKrige and then use PyKrige functionality. I use shapely, pandas etc for sampling shapefiles quite often :)

On your concerns on downsampling, one can bin the observations (both spatially and by magnitude) and then sample from each bin for a representative set for Kriging. I actually do very similar operations. I think this is very problem specific and should be left to the user.

One can also go the other way in machine learning, i.e., create artificial observations when there are not enough observations.

@kvanlombeek what do you think? I will be, however, very interested to know how you down sample your observations.

kvanlombeek commented 7 years ago

I think we are discussing next to each other a bit, but your post from yesterday @basaks was exactly what I wanted to do.

Can I ask to all of you on how much data you have already fitted this algorithm? 100, 1.000, 100.000 or 1.000.000 points? Not to show of or something, just to see if it worked. I fit the algorithm on around 50k to 200k datapoints, and then I needed these littles tweaks.

I propose that I give it a shot, open a PR and continue to discuss based on that?

basaks commented 7 years ago

hi @kvanlombeek I have one dataset with ~2.2million observations across all of Australia. For many machine learning models, I use ~50k, sometimes upto 200k observations.

rth commented 7 years ago

@kvanlombeek It's great that you are using PyKrige with large datasets and can help with the scaling issues. However, I would tend to agree with @basaks and @bsmurphy on this one; there is potentially a lot of ways this could be done, and it can also be problem specific. Can you not downsample the input X, Y, Z arrays used to train the variogram before providing it to OrdinaryKriging in the way you need?

I'm +1 on adding some additional function for downsampling that would make sens in the Kriging context (if such a thing exists), and -1 on adding it inside the OrdinaryKriging classes which already have quite a bit of parameters IMO.. Unless I'm misunderstanding the issue..

basaks commented 7 years ago

Instead of downsampling, I think what is more useful, and seems like an already used technique in the kriging community, is the local variogram model. Can someone work on this instead? This will speed up computation manyfolds for large models.

VESPER is a PC-Windows program developed by the Australian Centre for Precision Agriculture (ACPA) for spatial prediction that is capable of performing kriging with local variograms (Haas, 1990). Kriging with local variograms involves searching for the closest neighbourhood for each prediction site, estimating the variogram from the neighbourhood, fitting a variogram model to the data and predicting the value and its uncertainty. The local variogram is modelled in the program by fitting a variogram model automatically through the nonlinear least-squares method. Several variogram models are available, namely spherical, exponential, Gaussian and linear with sill. Punctual and block kriging is available as interpolation options. This program adapts itself spatially in the presence of distinct differences in local structure over the whole field.

bsmurphy commented 7 years ago

@rth added the moving window function, which is similar to what you're suggesting @basaks except that it assumes a stationary variogram. Adding in the extra layer of re-estimating the variogram for local neighborhoods could certainly be done at some level, but it would require a calculation for each moving window location; therefore, not sure how much it would really boost speed... Anyways, both of these ideas would be nice to implement at some point: the local variogram estimation for max flexibility and the downsampling in global variogram estimation as a useful tool (could be put in the kriging tools module)...

rth commented 7 years ago

@basaks Feel free to open a new issue with the local variogram model..

rth commented 6 years ago

When there are a lot of observations, downsampling may provide a very useful boost to performance without losing much accuracy. However, I don't think it is necessary to have this built in with this package.

Actually, looking into this anew, as I understand it the issue is that currently, the API does not allow to calculate the variogram on a random subset of the data, then run all the remaining calculations on the full dataset. I don't think this would be too difficult to implement...

bsmurphy commented 6 years ago

True, shouldn't be difficult at all to do something really simple like take a random subset (maybe a couple of times) and compute the variogram estimate. This could be a useful addition in refactoring everything into distinct building blocks as per #56 ...

MuellerSeb commented 4 years ago

In GSTools we provide a sample size for variogram estimation. Maybe we could outsource the variogram estimation.

MuellerSeb commented 4 years ago

Since we will use the variogram estimation routines of GSTools in the future, we will discuss things like this here: #136 Closing for now. Feel free to re-open or (better) discuss in the linked issue.

GeoStat-Framework / PyKrige

option to downsample the data to calculate the variogram #29