JeffreyRacine / R-Package-np

R package np (Nonparametric Kernel Smoothing Methods for Mixed Data Types)
https://socialsciences.mcmaster.ca/people/racinej
47 stars 18 forks source link

The references of getting the data-driven bandwidth #33

Closed zdhjeff closed 2 years ago

zdhjeff commented 2 years ago

Hi Jeff,

Thank you for the wonderful package! I am wondering what are the methodologies behind the data-driven bandwidth selection methods? I noticed that there is 'bwtype' argument that can choose 'generalized_nn' ( compute generalized nearest neighbors) or 'adaptive_nn' (compute adaptive nearest neighbors), how can this be done when there are categorical covariates? Could you mention some references related to this. Thanks!

JeffreyRacine commented 2 years ago

Greetings,

Thanks for the kind words!

re: adaptive/nearest neighbor... these are for the continuous covariates only... see the bw selection routine man pages for details... for instance ?npregbw (or ?npudensbw etc.) provides

"Three classes of kernel estimators for the continuous data types are available: fixed, adaptive nearest-neighbor, and generalized nearest-neighbor. Adaptive nearest-neighbor bandwidths change with each sample realization in the set, x[i], when estimating the density at the point x. Generalized nearest-neighbor bandwidths change with the point at which the density is estimated, x. Fixed bandwidths are constant over the support of x."

When there is a combination of categorical and continuous variables, the categorical bandwidth (say, \lambda\in[0,1]) is determined via numerical optimization while the continuous bandwidth is also determined similarly. If you use either of the nearest neighbor methods it selects the integer k that determines the k-th nearest neighbor per the details above.

For references perhaps start with those listed in ?npregbw (e.g., Racine, J.S. and Q. Li (2004), “Nonparametric estimation of regression functions with both categorical and continuous data,” Journal of Econometrics, 119, 99-130.)... the books provide further details (Princeton (2007) and Cambridge (2019), see my web page for supplementary material)...

Hope this helps!

zdhjeff commented 2 years ago

Thank you for the kind notice! Just to confirm that by saying "...When there is a combination of categorical and continuous variables......If you use either of the nearest neighbor methods it selects the integer k that determines the k-th nearest neighbor per the details above" , do you mean that it is still possible to obtain a data-driven bandwidth (i.e. using adaptive/nearest neighbor method ) , when there is a combination of categorical and continuous variables?

JeffreyRacine commented 2 years ago

Yes, of course. To clarify, it is not “still possible”, it is the default.. that is, when you use cross-validation to select smoothing parameters, cross-validation is used to select all smoothing parameters treating each data type appropriately. It would be strange if anything else were to happen would it not? Hope this helps!

zdhjeff commented 2 years ago

Yes, I just had some tries and they went well. The references are also quite helpful--Thanks!

zdhjeff commented 2 years ago

I guess I am still a little confused with the meaning of the "data-driven bandwidth" here-- if it is a "data-driven bandwidth", will the calculated "data-driven bandwidth" also depend on the new data point that is to be predicted (rather than just based on all the observed data points)?

JeffreyRacine commented 2 years ago

Hi,

“data-driven” simply means using the sample at hand to assign a value to each bandwidth, as opposed to using instead some ad-hoc value(s). If you select cross-validation, this uses the sample to numerically minimize (or maximize if appropriate) the cross-validation function. Since by definition a “new” data point is not in the sample used to generate the model, then it is not used to compute the bandwidths.

Hope this clarifies!

zdhjeff commented 2 years ago

Yeah, that is an excellent explanation--Thanks!