JeffreyRacine / R-Package-np

R package np (Nonparametric Kernel Smoothing Methods for Mixed Data Types)
https://socialsciences.mcmaster.ca/people/racinej
46 stars 18 forks source link

«Error: invalid bandwidth» when using variable bandwidth types for plotting errors. #32

Closed iburunat closed 3 years ago

iburunat commented 3 years ago

Thank you for this excellent complete package.

I am attempting to plot the bootstrapped standard errors using fit computed using an adaptive_nn and fixed bandwidths. I noticed that the plot does not succeed when using the former. It returns the following error:

Error in npreg.rbandwidth(txdat = xdat[indices, ], tydat = ydat[indices],  : 

                             ** Error: invalid bandwidth. 

Here a minimal, reproducible example:

library(np)
x <- c(0.36, 0.49, 0.36, 0.40, 0.10, 0.50, 0.01, 0.80, 0.20)
y <- c(1,2,5,3,4,6,5,0,1)

bw <- np::npregbw(formula = y ~ x, 
                  bwtype = "adaptive_nn",   # or "generalized_nn"; comment this line and it succeeds
                  regtype = "lc")
fit <- np::npreg(bw)

plot(fit, plot.errors.method = "bootstrap")

It would seem that there is something different about the coding of the fixed vs variable bandwidth types. Thanks for your help in advance.

JeffreyRacine commented 3 years ago

Hi,

This happens because you are using such an extremely small sample that some bootstrap samples have fewer than 7 (the k-nn value delivered by cross-validation) unique observations in them. The code is not set up to check for these situations unfortunately. The solution is to either a) use a more realistic sample size (nonparametric estimation with only 9 observations is clearly futile) or write a simple bootstrap procedure that tosses samples with fewer than 7 (k-nn) observations (the k-nn is on either side of points in the interior of the support). Realize the bootstrap is sampling with replacement so taking bootstrap resamples from 9 observations will often deliver samples with fewer than 7 unique observations.

The following will work for your tiny data set (manually use a smaller k-nn which avoids samples with too few unique observations)

model <- npreg(y~x,bwtype="adaptive_nn",bws=2) plot(model,plot.errors.method="bootstrap")

Sorry for the inconvenience, hope this helps!

Jeff

iburunat commented 3 years ago

Thanks, it was a bad choice for a reproducible, minimal example. With my data (n> 2000 points) it throws this error. I will have to check what fails in order to give you a reproducible example if it happens that my code is correct. Will come back to this.

iburunat commented 3 years ago

Ok, here we go again: x is supposed to reproduce ages of participants and y responses ranging from 1 to 5.

example 1: succeeds

set.seed(40)
x <- sample(11:90, 1000, replace=T)    
y <- sample(1:5, 1000, replace=T)    
bw1 <- np::npregbw(formula = y ~ x, bwtype = "adaptive_nn", regtype = "lc")
fit1 <- np::npreg(bws=bw1)
plot(fit1, plot.errors.method = "bootstrap")

example 2: fails

set.seed(44) 
x <- sample(11:90, 1000, replace=T)    
y <- sample(1:5, 1000, replace=T)    
bw1 <- np::npregbw(formula = y ~ x, bwtype = "adaptive_nn", regtype = "lc")
fit1 <- np::npreg(bws=bw1)
plot(fit1, plot.errors.method = "bootstrap")

My real data pattern must contain a similar one to example 2 which makes the algorithm fail.

JeffreyRacine commented 3 years ago

Perhaps avoid the bootstrap resample procedure and use the asymptotic standard errors instead? Or as I said earlier write a simple routine to discard the samples with too few unique observations (would not recommend nor use this personally). Sometimes when things fail it is nature telling you it is a bad idea!

Hope this helps!

JeffreyRacine commented 3 years ago

By the way, the reason your example(s) likely fails is

  1. because x and y are independent so the optimal bandwidth is infinity…
  2. your data are discrete not continuous but you are using a kernel for continuous datatypes...

The following runs fine... if it occasionally were to crap out that would be informative (i.e. I would pay attention not ignore the cause)

library(np) n <- 100 x <- rnorm(n) y <- x + rnorm(n)

fit <- npreg(y ~ x, bwtype = "adaptive_nn", # or "generalized_nn"; comment this line and it succeeds regtype = "lc")

plot(fit, plot.errors.method = "bootstrap")

iburunat commented 3 years ago

Thanks again for your replies :)

By the way, the reason your example(s) likely fails is

  1. because x and y are independent so the optimal bandwidth is infinity…
  2. your data are discrete not continuous but you are using a kernel for continuous datatypes...

Explanation (1) is intriguing. I checked by correlating the variables, but in both cases the correlation was around zero. As for (2), would using ordered() for the y variable (the 5 point scale) make the kernel suited for discrete datatypes? The x variable represents ages, so that is a continuous variable.

The following runs fine... if it occasionally were to crap out that would be informative (i.e. I would pay attention not ignore the cause)

As for your previous response:

Perhaps avoid the bootstrap resample procedure and use the asymptotic standard errors instead?

Could you refer me to a source where to read about the differences between these two methods?

Thanks again!

JeffreyRacine commented 3 years ago

Hi,

re: source on difference between asymptotic and bootstrap standard errors... the book Nonparametric Econometrics by Pagan & Ullah (1999), similar title by Li & Racine (2007), similar title by Racine (2019) (quick search ought to locate)... kindly see the research tab on my McMaster webpage for further details... (https://socialsciences.mcmaster.ca/people/racinej)

Best,

Jeff

iburunat commented 3 years ago

One last thing if I may:

your data are discrete not continuous but you are using a kernel for continuous datatypes...

How to use a kernel suited for discrete types? I used ordered(y) for the y variable (the 4 point scale) but results were identical as not using it. Not sure what else to change in the function. Thanks again!

JeffreyRacine commented 3 years ago

There is no kernel smoothing of the Y variable using kernel regression, only kernel smoothing of the X (hence why there is no difference). So

fit <- npreg(y~ordered(x))

will invoke an ordered kernel for X... to see do something like

summary(fit$bws)

and it will tell you which kernel function(s) were used for each of your predictors...

Hope this helps!

Jeff

iburunat commented 3 years ago

Thanks! I tried that for the sake of experimenting (although my x variable is continuous as it's the ages of participants), using adaptive_nn and results are similar when plotting the curve.

The bandwidth however is not an integer, which puzzles me since this was intended to be a k for nearest neighbors:

              ordered(x)
Bandwidth(s):  0.8895849
Regression Type: Local-Constant
Bandwidth Selection Method: Least Squares Cross-Validation
Formula: y ~ ordered(x)
Bandwidth Type: Adaptive Nearest Neighbour
Objective Function Value: 1.165879 (achieved on multistart 1)

Ordered Categorical Kernel Type: Li and Racine
No. Ordered Categorical Explanatory Vars.: 1

Let me know what you think.

PD: using unordered spitted Error in unordered(x) : could not find function "unordered"

JeffreyRacine commented 3 years ago
  1. NN bandwidths apply to continuous datatypes only, they make no sense for factors (the kernel functions for continuous and categorical datatypes operate totally differently), so the option is ignored if you have only 1 categorical X variable, otherwise with > 1 X variable, at least 1 of which is continuous, adaptive_nn and generalized_nn will work as expected. Perhaps see my recent textbook for an overview of discrete kernel functions if you wish... the bandwidth for ordered(x) reported lies in [0,1] and is real-valued and continuous.

  2. Datatypes in R for the np package are a) numeric(), b) factor(), and c) ordered() (b) and c) are unordered and ordered categorical datatypes, respectively). These are not specific to the np package but are part of the R language and data constructs (as they are for many other statistical computing languages as well). So the error has nothing to do with the np package, you need to use factor(). Perhaps the vignette for the np package might be of value to you...

vignette("np_faq",package="np") provides answers to frequently asked questions vignette("np",package="np") an overview vignette("entropy_np",package="np") an overview of entropy-based methods

Hope this helps!

Jeff