JeffreyRacine / R-Package-np

R package np (Nonparametric Kernel Smoothing Methods for Mixed Data Types)
https://socialsciences.mcmaster.ca/people/racinej
47 stars 18 forks source link

Reporting of bandwidth when using _nn variants #11

Closed JeffreyRacine closed 3 years ago

JeffreyRacine commented 8 years ago

A user reports

Another observation: Upon viewing summary(bw), the package reports the following:

Regression Data (5937 observations, 2 variable(s)):

Regression Type: Local-Constant Bandwidth Selection Method: Expected Kullback-Leibler Cross-Validation Bandwidth Type: Generalized Nearest Neighbour Objective Function Value: -0.5980996 (achieved on multistart 1)

Exp. Var. Name: dat.batted_ball_velocity Bandwidth: 37 Exp. Var. Name: dat.angle Bandwidth: 38

Continuous Kernel Type: Second-Order Gaussian No. Continuous Explanatory Vars.: 2 Estimation Time: 3535.85 seconds

The documentation tells me that the object "bw," if generated using generalized_nn, should contain the nearest neighbors rather than the bandwidths, but here it seems to report bandwidths (and it seems to be a constant bandwidth, even though this is supposedly a variable-bandwidth method).

To avoid confusion, we ought to change "Bandwidth: 37" in such cases to "Kth Nearest Neighbour: 37"

nickreich commented 7 years ago

@JeffreyRacine I ran into this issue earlier today when working with the np package with a few colleagues: @gcgibson, @jfk8889, and @elray1. Can you clarify exactly what the values in the "bandwidth" vector are referring to when bwtype is specified as a _nn option? In our example we had something like this:

> m <- npcdensbw(
                    xdat = x,
                    ydat = y,
                    nmulti = 2,
                    remin = FALSE,
                    bwtype = "adaptive_nn",
                    bwmethod = "cv.ml"
                )
> m$xbw
[1] 24 58
> m$ybw
[1] 17

Are the xbw referring to indices in the data for the Kth NN?

In case it is useful, here is the summary of m:

> summary(m)

Conditional density data (208 observations, 3 variable(s))
(1 dependent variable(s), and 2 explanatory variable(s))

No. Complete Observations:  208 
No. Incomplete (NA) Observations:  2 
Observations omitted or excluded:  1 2 

Bandwidth Selection Method: Maximum Likelihood Cross-Validation
Bandwidth Type: Adaptive Nearest Neighbour
Objective Function Value: -1208.155 (achieved on multistart 1)

Exp. Var. Name: L1R10 Bandwidth: 24 
Exp. Var. Name: L2R10 Bandwidth: 58  

Dep. Var. Name: 10    Bandwidth: 17 

Continuous Kernel Type: Second-Order Gaussian
No. Continuous Explanatory Vars.: 2
No. Continuous Dependent Vars.: 1
Estimation Time: 6.38 seconds
JeffreyRacine commented 7 years ago

Greetings,   The values are as you suspect the integers `k’ for the kth nearest neighbors for each covariate… the $xbw are for the predictors X, $ybw for the response Y…   Jeff  

From: Nicholas G Reich notifications@github.com Reply-To: JeffreyRacine/R-Package-np reply@reply.github.com Date: Wednesday, August 9, 2017 at 21:58 To: JeffreyRacine/R-Package-np R-Package-np@noreply.github.com Cc: "Racine, Jeffrey" racinej@mcmaster.ca, Mention mention@noreply.github.com Subject: Re: [JeffreyRacine/R-Package-np] Reporting of bandwidth when using _nn variants (#11)

 

@JeffreyRacine I ran into this issue earlier today when working with the np package with a few colleagues: @gcgibson, @jfk8889, and @elray1. Can you clarify exactly what the values in the "bandwidth" vector are referring to when bwtype is specified as a _nn option? In our example we had something like this:

m <- npcdensbw(                     xdat = x,                     ydat = y,                     nmulti = 2,                     remin = FALSE,                     bwtype = "adaptive_nn",                     bwmethod = "cv.ml"                 ) m$xbw [1] 24 58 m$ybw [1] 17 Are the xbw referring to indices in the data for the Kth NN? In case it is useful, here is the summary of m: summary(m)   Conditional density data (208 observations, 3 variable(s)) (1 dependent variable(s), and 2 explanatory variable(s))   No. Complete Observations:  208 No. Incomplete (NA) Observations:  2 Observations omitted or excluded:  1 2   Bandwidth Selection Method: Maximum Likelihood Cross-Validation Bandwidth Type: Adaptive Nearest Neighbour Objective Function Value: -1208.155 (achieved on multistart 1)   Exp. Var. Name: L1R10 Bandwidth: 24 Exp. Var. Name: L2R10 Bandwidth: 58    Dep. Var. Name: 10    Bandwidth: 17   Continuous Kernel Type: Second-Order Gaussian No. Continuous Explanatory Vars.: 2 No. Continuous Dependent Vars.: 1 Estimation Time: 6.38 seconds — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

AtomicNess123 commented 3 years ago

In my case, I obtain a bw=12. Does this mean that the algorithm is using 12 nearest neighbours as the optimal k to compute the variable bandwidth? Could the user select the k? Thanks.

JeffreyRacine commented 3 years ago

You mention only 1 bw, so I presume below you are doing univariate density estimation. Yes, as you appreciate the data-driven bandwidth is the 12th NN for each observation.

To specify the NN value manually simply add the options (i.e., append these to those you are already using)

bws=13,bandwidth.compute=FALSE

when you invoke your routine which tells the routine to use the 13th NN (the option bandwidth.compute=FALSE is needed as, otherwise, it will use the value 13 as the starting point for search - see ?npudensbw for details).

Hope this helps clarify!

Jeff

AtomicNess123 commented 3 years ago

Thanks! That is clear now. I know with this package is possible to obtain confidence bands for the estimated coefficient based on bootstrap procedures. Is it possible to obtain the which points are statistically significan? Couldn't find this in the documentation.

JeffreyRacine commented 3 years ago

Presuming by "estimated coefficient" you mean the derivative of the regression estimate (the nonlinear `marginal effect', i.e., beta(x)=d g(y|x)/dx where g(y|x) is the conditional mean), there is no simple way to tell which estimates of beta(x) are significantly different from zero in a meaningful way. This is because the bootstrap and asymptotic confidence bounds are pointwise and you probably want simultaneous ones... sorry...

The following code will produce the bounds but the intervals are pointwise as noted...

library(np) data(cps71) attach(cps71)

ghat <- npreg(logwage~age,regtype="ll") plot(ghat,gradients=TRUE,plot.errors.method="bootstrap") abline(h=0)

You can extract the bootstrap standard errors following the examples in ?npplot but, again, probably not what you are looking for.

Hope this helps!

AtomicNess123 commented 3 years ago

Thanks for the prompt answer and solution, really appreciated.

In your example, for instance, would it be possible from the confidence intervals to see which differences between ages are significant.

I plotted this graph using this function to predict and compute confidence intervals for m(x). I think it does the same operation you mentioned in your response above:

image

I also found the function npsigtest, but it gives the significance of the entire independent variable "Ages" (and not the significance of the individual points:

                  age
Bandwidth(s): 3.268425

Individual Significance Tests
P Value: 
age < 2.22e-16 ***
---

This paper explains how to calculate CI with bootstrapping. I suppose it's the same approach as the one used above.

JeffreyRacine commented 3 years ago

Yes, if you want to do proper inference for H_0:\beta(x)=0 a.e. this is the way to do it. You can't use pointwise intervals for the same reason you use a joint F-test rather than a bunch of t-tests (the multiple comparison problem). I guess you could use the pointwise values and do some sort of Bonferroni-Hochberg correction though but that would be up to you and your ability to set this up properly.

Hope this helps.

AtomicNess123 commented 3 years ago

So the way to go you mean is the npsigtest function, right? If it turns out significant it would mean that all ages are significantly different from zero but we wouldn't know which ones are significantly different from each other, right? For that we could maybe do ANOVAs?

AtomicNess123 commented 3 years ago

To specify the NN value manually simply use the options

bws=13,bandwidth.compute=FALSE

I am using npregbw. If I set bws to 13, and bandwidth.compute=FALSE, does the function understand 13 nearest neighbours or bandwidth length = 13? This is not clear to me (again, related to the main question of the thread).

Do I still need to set the bwtype?

JeffreyRacine commented 3 years ago

re: "we could maybe do ANOVAs"... I have tried to clarify why npsigtest is valid for a global hypothesis \beta(x)=0 a.e., and how a set of of pointwise CIs cannot substitute due to a multiple comparison problem... something for a subset of known points could certainly be pursued, but this is outside of any functionality offered in the package so I have nothing further to add.

re: "So I still need to set the bwtype". Indeed! I have edited the comment to reflect this.

Hope this helps!