Ability to specify cutoff

stevenpawley commented 7 years ago

Hello,

Thanks for your work in the very useful Pykrige. One feature that I haven't found is the ability to specify the variogram cutoff, i.e. the distance up to which the variogram is calculated. Currently it appears that the variogram in Pykrige is calculated across the full distance of the data, whereas the typical cutoff is 1/3 of the diagonal distance of the data (i.e. gstat's default), and being able to specify this is important for many datasets. Apologies if this already exists and I've missed it, but otherwise this would be a useful addition.

tomvansteijn commented 7 years ago

I second this request. The option "n_closest_points" is available for ordinary kriging (ok.py), but not for universal kriging. It would be useful to be able to specify a cutoff distance for both ordinary kriging and universal kriging. Indeed, as in gstat.

rth commented 7 years ago

One feature that I haven't found is the ability to specify the variogram cutoff, i.e. the distance up to which the variogram is calculated.

I believe that was discussed in issue https://github.com/bsmurphy/PyKrige/issues/41, @basaks might know more about this..

The option "n_closest_points" is available for ordinary kriging (ok.py), but not for universal kriging. It would be useful to be able to specify a cutoff distance for both ordinary kriging and universal kriging.

This should be possible once the step 4 of refactoring in issue https://github.com/bsmurphy/PyKrige/issues/31 is done. Unfortunately I am not able to work on this issue at the moment, and so unless somebody takes it over, I'm not sure when it would be done. In this particular case a simpler solution could be to add to uk.py the code relevant to n_closest_points (mostly this section, and possibly only the backend='loop') from ok.py. Pull request on this would be welcome @tomvansteijn !

bsmurphy commented 7 years ago

@stevenpawley, I'm not exactly sure what you mean by variogram cutoff... Do you mean localized variogram estimation like in issue #41, or do you mean specify the maximum lag distances used in estimating the variogram model, or do you mean kriging with a moving window (so only using a certain number of nearest points, or points within a certain distance) as in @tomvansteijn's comment?

Either way, @tomvansteijn's suggestion is definitely something that fits in with the long-term goals in issue #31, but I also won't have time to work on this (or any of the other refactors) for several more weeks (sorry for losing momentum on those, @rth). PRs are of course always welcome in the mean time!

stevenpawley commented 7 years ago

Hi Benjamin,

By cutoff, I mean the maximum lag distance over which the variogram is calculated using auto fitting of variogram function parameters. Even with weight=True, autofitting in some cases leads to a poor variogram model fit. This is because the fitting is influenced by wild oscillations in the semivariance that are exhibited at very large lag distances in some datasets. The default cutoff in gstat is 1/3 of the diagonal distance of the dataset, and I think it is usually recommended not calculate the semivariance of point pairs that exceed 1/2 the maximum distance of the data due to this phenomenon.

Sent from my iPhone

On Jun 12, 2017, at 6:14 PM, Benjamin Murphy notifications@github.com wrote:

@stevenpawley, I'm not exactly sure what you mean by variogram cutoff... Do you mean localized variogram estimation like in issue #41, or do you mean specify the maximum lag distances used in estimating the variogram model, or do you mean kriging with a moving window (so only using a certain number of nearest points, or points within a certain distance) as in @tomvansteijn's comment?

Either way, @tomvansteijn's suggestion is definitely something that fits in with the long-term goals in issue #31, but I also won't have time to work on this (or any of the other refactors) for several more weeks (sorry for losing momentum on those, @rth). PRs are of course always welcome in the mean time!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

bsmurphy commented 7 years ago

I agree it would be useful to have more flexibility to tune the auto variogram fitting routine. Currently, in the most recent version here (which is different than the most recent version on PyPI), the weighting forces lags ~> 70% of the max to go to zero (see comments here). This is hard-coded at the moment, but it would be easy enough to include a kwarg to allow the user more control. And maybe it makes sense to also enable this kind of weighting by default, with that arbitrary 70% set to 30% or 50% (I hadn't heard the 1/2 the max distance rule of thumb before, but I have certainly seen datasets where an auto fit to all lags would be very bad). @stevenpawley, if you'd like to take out a PR that'd be great, otherwise I'll add some tweaks as I get the chance in the coming weeks...

stevenpawley commented 7 years ago

Hi Benjamin,

Thanks for this. I'm away with work at the moment but I can certainly let take a look at this when I get back.

Steve

Sent from my iPhone

On Jun 12, 2017, at 11:10 PM, Benjamin Murphy notifications@github.com wrote:

I agree it would be useful to have more flexibility to tune the auto variogram fitting routine. Currently, in the most recent version here (which is different than the most recent version on PyPI), the weighting forces lags ~> 70% of the max to go to zero (see comments here). This is hard-coded at the moment, but it would be easy enough to include a kwarg to allow the user more control. And maybe it makes sense to also enable this kind of weighting by default, with that arbitrary 70% set to 30% or 50% (I hadn't heard the 1/2 the max distance rule of thumb before, but I have certainly seen datasets where an auto fit to all lags would be very bad). @stevenpawley, if you'd like to take out a PR that'd be great, otherwise I'll add some tweaks as I get the chance in the coming weeks...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

smholsen commented 5 years ago

Has anybody found a workable solution for adding this functionality yet?

Edit: Is there actually more to this than simply modifying the intialization of dmax in the function _initialize_variogram_model in core.py?

E.g.

dmax = cutoff_distance if cutoff_distance else np.amax(d)

I am a bit out of my domain of expertise here, but from what I gather this seems to provide expected results. The distances are followingly binned;

    dmin = np.amin(d)
    dd = (dmax - dmin) / nlags
    bins = [dmin + n * dd for n in range(nlags)]
    dmax += 0.001
    bins.append(dmax)

And then for each lag the semivariance is updated;

    for n in range(nlags):
        # This 'if... else...' statement ensures that there are data
        # in the bin so that numpy can actually find the mean. If we
        # don't test this first, then Python kicks out an annoying warning
        # message when there is an empty bin and we try to calculate the mean.
        if d[(d >= bins[n]) & (d < bins[n + 1])].size > 0:
            lags[n] = np.mean(d[(d >= bins[n]) & (d < bins[n + 1])])
            semivariance[n] = np.mean(g[(d >= bins[n]) & (d < bins[n + 1])])
        else:
            lags[n] = np.nan
            semivariance[n] = np.nan

Which means that the g values where the corresponding d > cutoff_distance are ignored.

Are there any flaws to this logic described above?

MuellerSeb commented 4 years ago

Related to #97.

MuellerSeb commented 4 years ago

Since we will use the variogram estimation routines of GSTools in the future, we will discuss things like this here: #136 There, we refactor the variogram estimation submodule ATM: https://github.com/GeoStat-Framework/GSTools/issues/55 Closing for now. Feel free to re-open or (better) discuss in the linked issue.

GeoStat-Framework / PyKrige

Ability to specify cutoff #57