Closed JeffreyRacine closed 3 years ago
@JeffreyRacine I ran into this issue earlier today when working with the np
package with a few colleagues: @gcgibson, @jfk8889, and @elray1. Can you clarify exactly what the values in the "bandwidth" vector are referring to when bwtype
is specified as a _nn
option? In our example we had something like this:
> m <- npcdensbw(
xdat = x,
ydat = y,
nmulti = 2,
remin = FALSE,
bwtype = "adaptive_nn",
bwmethod = "cv.ml"
)
> m$xbw
[1] 24 58
> m$ybw
[1] 17
Are the xbw
referring to indices in the data for the Kth NN?
In case it is useful, here is the summary of m:
> summary(m)
Conditional density data (208 observations, 3 variable(s))
(1 dependent variable(s), and 2 explanatory variable(s))
No. Complete Observations: 208
No. Incomplete (NA) Observations: 2
Observations omitted or excluded: 1 2
Bandwidth Selection Method: Maximum Likelihood Cross-Validation
Bandwidth Type: Adaptive Nearest Neighbour
Objective Function Value: -1208.155 (achieved on multistart 1)
Exp. Var. Name: L1R10 Bandwidth: 24
Exp. Var. Name: L2R10 Bandwidth: 58
Dep. Var. Name: 10 Bandwidth: 17
Continuous Kernel Type: Second-Order Gaussian
No. Continuous Explanatory Vars.: 2
No. Continuous Dependent Vars.: 1
Estimation Time: 6.38 seconds
Greetings, The values are as you suspect the integers `k’ for the kth nearest neighbors for each covariate… the $xbw are for the predictors X, $ybw for the response Y… Jeff
From: Nicholas G Reich notifications@github.com Reply-To: JeffreyRacine/R-Package-np reply@reply.github.com Date: Wednesday, August 9, 2017 at 21:58 To: JeffreyRacine/R-Package-np R-Package-np@noreply.github.com Cc: "Racine, Jeffrey" racinej@mcmaster.ca, Mention mention@noreply.github.com Subject: Re: [JeffreyRacine/R-Package-np] Reporting of bandwidth when using _nn variants (#11)
@JeffreyRacine I ran into this issue earlier today when working with the np package with a few colleagues: @gcgibson, @jfk8889, and @elray1. Can you clarify exactly what the values in the "bandwidth" vector are referring to when bwtype is specified as a _nn option? In our example we had something like this:
m <- npcdensbw( xdat = x, ydat = y, nmulti = 2, remin = FALSE, bwtype = "adaptive_nn", bwmethod = "cv.ml" ) m$xbw [1] 24 58 m$ybw [1] 17 Are the xbw referring to indices in the data for the Kth NN? In case it is useful, here is the summary of m: summary(m) Conditional density data (208 observations, 3 variable(s)) (1 dependent variable(s), and 2 explanatory variable(s)) No. Complete Observations: 208 No. Incomplete (NA) Observations: 2 Observations omitted or excluded: 1 2 Bandwidth Selection Method: Maximum Likelihood Cross-Validation Bandwidth Type: Adaptive Nearest Neighbour Objective Function Value: -1208.155 (achieved on multistart 1) Exp. Var. Name: L1R10 Bandwidth: 24 Exp. Var. Name: L2R10 Bandwidth: 58 Dep. Var. Name: 10 Bandwidth: 17 Continuous Kernel Type: Second-Order Gaussian No. Continuous Explanatory Vars.: 2 No. Continuous Dependent Vars.: 1 Estimation Time: 6.38 seconds — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
In my case, I obtain a bw=12. Does this mean that the algorithm is using 12 nearest neighbours as the optimal k to compute the variable bandwidth? Could the user select the k? Thanks.
You mention only 1 bw, so I presume below you are doing univariate density estimation. Yes, as you appreciate the data-driven bandwidth is the 12th NN for each observation.
To specify the NN value manually simply add the options (i.e., append these to those you are already using)
bws=13,bandwidth.compute=FALSE
when you invoke your routine which tells the routine to use the 13th NN (the option bandwidth.compute=FALSE is needed as, otherwise, it will use the value 13 as the starting point for search - see ?npudensbw for details).
Hope this helps clarify!
Jeff
Thanks! That is clear now. I know with this package is possible to obtain confidence bands for the estimated coefficient based on bootstrap procedures. Is it possible to obtain the which points are statistically significan? Couldn't find this in the documentation.
Presuming by "estimated coefficient" you mean the derivative of the regression estimate (the nonlinear `marginal effect', i.e., beta(x)=d g(y|x)/dx where g(y|x) is the conditional mean), there is no simple way to tell which estimates of beta(x) are significantly different from zero in a meaningful way. This is because the bootstrap and asymptotic confidence bounds are pointwise and you probably want simultaneous ones... sorry...
The following code will produce the bounds but the intervals are pointwise as noted...
library(np) data(cps71) attach(cps71)
ghat <- npreg(logwage~age,regtype="ll") plot(ghat,gradients=TRUE,plot.errors.method="bootstrap") abline(h=0)
You can extract the bootstrap standard errors following the examples in ?npplot but, again, probably not what you are looking for.
Hope this helps!
Thanks for the prompt answer and solution, really appreciated.
In your example, for instance, would it be possible from the confidence intervals to see which differences between ages are significant.
I plotted this graph using this function to predict and compute confidence intervals for m(x). I think it does the same operation you mentioned in your response above:
I also found the function npsigtest, but it gives the significance of the entire independent variable "Ages" (and not the significance of the individual points:
age
Bandwidth(s): 3.268425
Individual Significance Tests
P Value:
age < 2.22e-16 ***
---
This paper explains how to calculate CI with bootstrapping. I suppose it's the same approach as the one used above.
Yes, if you want to do proper inference for H_0:\beta(x)=0 a.e. this is the way to do it. You can't use pointwise intervals for the same reason you use a joint F-test rather than a bunch of t-tests (the multiple comparison problem). I guess you could use the pointwise values and do some sort of Bonferroni-Hochberg correction though but that would be up to you and your ability to set this up properly.
Hope this helps.
So the way to go you mean is the npsigtest function, right? If it turns out significant it would mean that all ages are significantly different from zero but we wouldn't know which ones are significantly different from each other, right? For that we could maybe do ANOVAs?
To specify the NN value manually simply use the options
bws=13,bandwidth.compute=FALSE
I am using npregbw. If I set bws to 13, and bandwidth.compute=FALSE, does the function understand 13 nearest neighbours or bandwidth length = 13? This is not clear to me (again, related to the main question of the thread).
Do I still need to set the bwtype?
re: "we could maybe do ANOVAs"... I have tried to clarify why npsigtest is valid for a global hypothesis \beta(x)=0 a.e., and how a set of of pointwise CIs cannot substitute due to a multiple comparison problem... something for a subset of known points could certainly be pursued, but this is outside of any functionality offered in the package so I have nothing further to add.
re: "So I still need to set the bwtype". Indeed! I have edited the comment to reflect this.
Hope this helps!
A user reports
Another observation: Upon viewing summary(bw), the package reports the following:
Regression Data (5937 observations, 2 variable(s)):
Regression Type: Local-Constant Bandwidth Selection Method: Expected Kullback-Leibler Cross-Validation Bandwidth Type: Generalized Nearest Neighbour Objective Function Value: -0.5980996 (achieved on multistart 1)
Exp. Var. Name: dat.batted_ball_velocity Bandwidth: 37 Exp. Var. Name: dat.angle Bandwidth: 38
Continuous Kernel Type: Second-Order Gaussian No. Continuous Explanatory Vars.: 2 Estimation Time: 3535.85 seconds
The documentation tells me that the object "bw," if generated using generalized_nn, should contain the nearest neighbors rather than the bandwidths, but here it seems to report bandwidths (and it seems to be a constant bandwidth, even though this is supposedly a variable-bandwidth method).
To avoid confusion, we ought to change "Bandwidth: 37" in such cases to "Kth Nearest Neighbour: 37"