JeffreyRacine / R-Package-np

R package np (Nonparametric Kernel Smoothing Methods for Mixed Data Types)
https://socialsciences.mcmaster.ca/people/racinej
46 stars 18 forks source link

choice of regtype and bwmethod for npreg #29

Closed waynelapierre closed 3 years ago

waynelapierre commented 3 years ago

I am currently learning your great np package. I noticed for all the npreg examples on your np package manual, regtype is set to "ll" and bwmethod is set to "cv.aic". Is this choice in general more preferable than their default choice "lc" and "cv.ml"? If yes, then why not make "ll" and "cv.aic" the default choice? Thanks.

JeffreyRacine commented 3 years ago

Greetings Wayne,

Thanks for your interest in the methods. The answer is "it depends"...

  1. The LC (local constant) estimator can automatically remove irrelevant variables by over smoothing when appropriate (when cross validation assigns a "large" bandwidth, basically infinity - if the "scale factor" in the bandwidth summary exceeds, say, 5 or so then essentially it has been removed. But the local linear estimator cannot do this since, as the bandwidth gets large (over smoothing) the local linear estimator becomes globally linear and is simply OLS.
  2. In general cv.aic tends to under smooth less often... but when we wrote the package the theoretical underpinnings of cv.aic were not available.

So, this is why LC and cv.ls are the defaults (local constant with least squares cross-validated bandwidth selection).

If you really want to take it up a level, see the function npglpreg() in the R package crs... it uses cross-validation to choose both the order of the polynomial and bandwidths per Hall, P. and J.S. Racine (2015), “Infinite Order Cross-Validated Local Polynomial Regression,” Journal of Econometrics, 185, 510-525. To my way of thinking this removes your choice of "LC or LL" and allows for any order polynomial that automatically adapts to the underlying DGP... hers is a simple illustration...

set.seed(42) x <- runif(1000) x <- runif(1000,-2,2) dgp <- x^3 + rnorm(1000) set.seed(42) x <- runif(1000,-2,2) y <- x^3 + rnorm(1000) model <- npglpreg(y~x) summary(model) Call: npglpreg.formula(formula = y ~ x)

Generalized Local Polynomial Kernel Regression

Polynomial type: Bernstein Using (local) Seifert & Gasser shrinkage for cross-validation There is 1 continuous predictor Bandwidth type: fixed Continuous kernel type: gaussian Continuous kernel order: 2 Bandwidth for x: 0.9974018 (scale factor = 3.406242) Degree for x: 3 Training observations: 1000 Multiple R-squared: 0.9063 Cross-validation score: 1.0055181 Number of multistarts: 5 Estimation time: 4.4 seconds

Note that the data generating process was in fact x^3 and cross-validation selected a polynomial order of 3 plus a "large" bandwidth with scale factor of 3.4... this estimator will be more efficient than either the LC or LL estimator in this instance since it has "adapted" to the data generating process fully...

Hope this helps!

Jeff

waynelapierre commented 3 years ago

thanks so much for the detailed explanation!