JeffreyRacine / R-Package-np

R package np (Nonparametric Kernel Smoothing Methods for Mixed Data Types)
https://socialsciences.mcmaster.ca/people/racinej
47 stars 18 forks source link

npsigtest cannot test when dependent variable is discrete #19

Closed lina0920 closed 5 years ago

lina0920 commented 5 years ago

Thanks a lot if anyone can answer this question.

Would it be possible to implement np significance test when dependent variable is binary or discrete? If I do npreg with a binary dependent variable, it reports a error that

"Error in npsigtest.rbandwidth(xdat = xdat, ydat = ydat, bws = bws, ...) : dependent variable must be continuous."

If I do npconmode with a binary dependent variable, it reports a error

"Error in toFrame(xdat) : xdat must be a data frame, matrix, vector, or factor".

Thank you!

JeffreyRacine commented 5 years ago

Can you kindly post code/data to replicate? Presume your dependent variable is of type numeric/integer (needs to be)...

lina0920 commented 5 years ago

Dear Professor Racine,

Thank you so much for answering me. I am actually want to apply two of your papers to a nonparametric binary choice model: "Nonparametric estimation of regression functions in the presence of irrelevant regressors" (2007) RES, "Testing the Significance of Categorical Predictor Variables in Nonparametric Regression Models" (2006) ER.

I start from a very simple DGP, D is a binary dependent variable, Z is binary Pr(Z=1)=0.5, X is N(0,1). True DGP is D=1(aZ+bX+e>0), where e N(0,1), b=-1. Here I set a=0, so Z is irrelevant.

My goals are: (1) Apply your (2007) RES paper to consistently estimate E(D|Z,X)=Pr(D=1|Z,X)=Pr(D=1|X), select the bandwidth of Z to smooth out the irrelevant Z when there is a continuous X. (2) To test significance of the binary variable Z using your (2006) ER paper and hope to get a right size of the test (with say 1000 replications).

My questions are: (1) If Z is irrelevant, its bandwidth should be close to 1 to smooth it out. However, using both npregbw and npconmode below report a close to 0.5 bandwidth of Z in most of the replications. And I didn't find a exact bandwidth selection function corresponding to 2007 RES paper in np package. Which function should I use to select bandwidth when D is binary and there exists irrelevant variables?

(2) The npsigtest works well after the npreg and with a continuous dependent variable. But how can I implement a significance test using npsigtest (or other functions) when D is binary?

(3) When D is binary, using npreg and npconmode usually give quite different results. The paper "Nonparametric Econometrics: The np Package" JSS (2008) uses npconmode when D is binary, does it mean npreg is only for continuous dependent variable?

Thank you again and here is my data and code for one replication, results and errors. data1.xlsx

Code as follows.

data1=data.frame(D=factor(D),Z=factor(Z),X) bw.all=npregbw(D~Z+X,regtype="ll",bwmethod="cv.aic",data=data1) model.np=npreg(bws=bw.all) summary(model.np) npsigtest(model.np)

Results: Regression Data: 100 training points, in 2 variable(s) Z X Bandwidth(s): 0.1940525 2.056779

Kernel Regression Estimator: Local-Linear Bandwidth Type: Fixed Residual standard error: 0.4074307 R-squared: 0.3265794

Continuous Kernel Type: Second-Order Gaussian No. Continuous Explanatory Vars.: 1

Unordered Categorical Kernel Type: Aitchison and Aitken No. Unordered Categorical Explanatory Vars.: 1

Errors: npsigtest(model.np) Error in npsigtest.rbandwidth(xdat = xdat, ydat = ydat, bws = bws, ...) : dependent variable must be continuous.

If I do the following code:

model.np1=npconmode(D~Z+X,data=data1) summary(model.np1) npsigtest(model.np1)

Results: Conditional Mode data: 100 training points, in 3 variable(s) (1 dependent variable(s), and 2 explanatory variable(s)) D Dep. Var. Bandwidth(s): 1.084088e-07 Z X Exp. Var. Bandwidth(s): 0.1956927 0.3897006

Bandwidth Type: Fixed

Confusion Matrix Predicted Actual 0 1 0 50 6 1 16 28

Overall Correct Classification Ratio: 0.78 Correct Classification Ratio By Outcome: 0 1 0.8928571 0.6363636

McFadden-Puig-Kerschner performance measure: 0.7508

Continuous Kernel Type: Second-Order Gaussian No. Continuous Explanatory Vars.: 1

Unordered Categorical Kernel Type: Aitchison and Aitken No. Unordered Categorical Explanatory Vars.: 1 No. Unordered Categorical Dependent Vars.: 1

Errors: npsigtest(model.np1) Error in toFrame(xdat) : xdat must be a data frame, matrix, vector, or factor.

Hope to hear from you soon. Thank you a lot.

JeffreyRacine commented 5 years ago

Apologies for the delay. Note that npreg() and npsigtest() need Y to be numeric (you cast it as factor()). Both npreg() and npconmode() can estimate Pr(Y=1|X=x)... for npreg() the fitted values are Pr(Y=1|X=x) when Y is 0/1... for npconmode it coaxes Pr(Y=1|X=x) from the estimated f(y|x)...

Hope this helps!

lina0920 commented 5 years ago

Dear Prof. Racine,

Yes it helps a lot. Thank you for your time!

AtomicNess123 commented 3 years ago

Hi, I have a question of a more basic nature. I have recently started using nonparametric regression. When modelling two x variables (eg., age, one for males and another for female groups) with kernel regression to predict y (a behavioural variable), is there a way to statistically compare the two kernel regression curves to say whether they are significantly different?

JeffreyRacine commented 3 years ago

Sure,

Kindly see ?npsigtest... this conducts a significance test for each predictor by default, or for one predictor if you prefer (you provide the `index' of the variable, e.g., if your first regressor is age and second sex then use the option index=2).

If the null of irrelevance of the predictor `sex' is not rejected, conclude there is no statistically significant difference between the conditional mean E(y|x1) and E(y|x1,x2) (i.e., x2 is "irrelevant").

Hope this helps.

AtomicNess123 commented 3 years ago

Thanks for your time and patience, and apologies for a follow-up question:

If the null of irrelevance of the predictor `sex' is not rejected, conclude there is no statistically significant difference between the conditional mean E(y|x1) and E(y|x1,x2) (i.e., x2 is "irrelevant").

This means, there is no difference between the mean of y given x1 (male) and the mean of y given both male and female? I can't grasp this. What I am aiming to test is whether there is a significant difference between male and female curves for y.

JeffreyRacine commented 3 years ago

It is the counterpart to a test of significance in parametric regression, that's all. You simply test the significance of the predictor `sex'.

Hope this helps.