Closed AlJacq closed 3 years ago
Greetings,
Thanks for your interest in the package.
First, regarding "not so stable"... When you change optic.reltol to .1 expect very unstable results (default is 1.490116e-08). Try nmulti=1, 10, etc. Also, not sure what you mean by "stable" versus "not so stable"... numerical optimization with local minima can lead to instabilities which is why multi starting is highly recommended... below I use nmulti=1,5, 10,100, all produce the same results (up to optimization noise)
R version 3.6.2 (2019-12-12) -- "Dark and Stormy Night"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin19.3.0/x86_64 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> rm(list=ls())
> library(np)
Nonparametric Kernel Methods for Mixed Datatypes (version 0.60-10)
[vignette("np_faq",package="np") provides answers to frequently asked questions]
[vignette("np",package="np") an overview]
[vignette("entropy_np",package="np") an overview of entropy-based methods]
> options(np.messages=FALSE)
> data(wage1)
> attach(wage1)
>
>
> ## Fit the model on wage
>
> model <- npindex(wage ~ female + married + educ + exper + tenure, nmulti=1)
> coef(model)
female married educ exper tenure
1.00000000 -0.32447740 0.35694446 0.01409685 0.05058387
> summary(model)
Single Index Model
Regression Data: 526 training points, in 5 variable(s)
female married educ exper tenure
Beta: 1 -0.3244774 0.3569445 0.01409685 0.05058387
Bandwidth: 0.382716
Kernel Regression Estimator: Local-Constant
Residual standard error: 2.833315
R-squared: 0.4148631
Continuous Kernel Type: Second-Order Gaussian
No. Continuous Explanatory Vars.: 1
>
> model <- npindex(wage ~ female + married + educ + exper + tenure, nmulti=5)
> coef(model)
female married educ exper tenure
1.00000000 -0.32404705 0.35658793 0.01404960 0.05047562
> summary(model)
Single Index Model
Regression Data: 526 training points, in 5 variable(s)
female married educ exper tenure
Beta: 1 -0.324047 0.3565879 0.0140496 0.05047562
Bandwidth: 0.3826656
Kernel Regression Estimator: Local-Constant
Residual standard error: 2.833412
R-squared: 0.4148412
Continuous Kernel Type: Second-Order Gaussian
No. Continuous Explanatory Vars.: 1
>
> model <- npindex(wage ~ female + married + educ + exper + tenure, nmulti=10)
> coef(model)
female married educ exper tenure
1.00000000 -0.32404705 0.35658793 0.01404960 0.05047562
> summary(model)
Single Index Model
Regression Data: 526 training points, in 5 variable(s)
female married educ exper tenure
Beta: 1 -0.324047 0.3565879 0.0140496 0.05047562
Bandwidth: 0.3826656
Kernel Regression Estimator: Local-Constant
Residual standard error: 2.833412
R-squared: 0.4148412
Continuous Kernel Type: Second-Order Gaussian
No. Continuous Explanatory Vars.: 1
>
> model <- npindex(wage ~ female + married + educ + exper + tenure, nmulti=100)
> coef(model)
female married educ exper tenure
1.00000000 -0.32425708 0.35657164 0.01404367 0.05047304
> summary(model)
Single Index Model
Regression Data: 526 training points, in 5 variable(s)
female married educ exper tenure
Beta: 1 -0.3242571 0.3565716 0.01404367 0.05047304
Bandwidth: 0.3826423
Kernel Regression Estimator: Local-Constant
Residual standard error: 2.833411
R-squared: 0.4148411
Continuous Kernel Type: Second-Order Gaussian
No. Continuous Explanatory Vars.: 1
>
> proc.time()
user system elapsed
582.190 2.588 586.219
Second, both the wage and log wage specifications provide comparable fits to the data (R^2 of 40-45%)... so I guess I am not sure what you were expecting...
R version 3.6.2 (2019-12-12) -- "Dark and Stormy Night"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin19.3.0/x86_64 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> rm(list=ls())
> library(np)
Nonparametric Kernel Methods for Mixed Datatypes (version 0.60-10)
[vignette("np_faq",package="np") provides answers to frequently asked questions]
[vignette("np",package="np") an overview]
[vignette("entropy_np",package="np") an overview of entropy-based methods]
> options(np.messages=FALSE)
> data(wage1)
> attach(wage1)
>
> ## Function for robust standard error used to recover "scale factor"
> ## (i.e., constant c in h=c*sd*n^{-1/5} where h is the bandwidth, sd
> ## robust standard error, n the sample size
>
> EssDee <- function(y){
+ if(any(dim(as.matrix(y)) == 0))
+ return(0)
+ sd.vec <- apply(as.matrix(y),2,sd)
+ IQR.vec <- apply(as.matrix(y),2,IQR)/1.349
+ mad.vec <- apply(as.matrix(y),2,mad)
+ a <- apply(cbind(sd.vec,IQR.vec,mad.vec),1, function(x) max(x))
+ if(any(a<=0)) warning(paste("variable ",which(a<=0)," appears to be constant",sep=""))
+ a <- apply(cbind(sd.vec,IQR.vec,mad.vec),1, function(x) min(x[x>0]))
+ return(a)
+ }
>
> ## Fit the model on logwage
>
> lmodel <- npindex(lwage ~ female + married + educ + exper + tenure)
> coef(lmodel)
female married educ exper tenure
1.000000000 -0.105668111 0.059893676 0.001208391 0.014349667
> summary(lmodel)
Single Index Model
Regression Data: 526 training points, in 5 variable(s)
female married educ exper tenure
Beta: 1 -0.1056681 0.05989368 0.001208391 0.01434967
Bandwidth: 0.07319441
Kernel Regression Estimator: Local-Constant
Residual standard error: 0.4011912
R-squared: 0.431404
Continuous Kernel Type: Second-Order Gaussian
No. Continuous Explanatory Vars.: 1
> ## Back out the scale factor
> lsf <- lmodel$bw/(EssDee(lmodel$index)*NROW(wage1)^{-1/5})
> lsf
0.4410832
>
> ## Fit the model on wage
>
> model <- npindex(wage ~ female + married + educ + exper + tenure)
> coef(model)
female married educ exper tenure
1.00000000 -0.32404705 0.35658793 0.01404960 0.05047562
> summary(model)
Single Index Model
Regression Data: 526 training points, in 5 variable(s)
female married educ exper tenure
Beta: 1 -0.324047 0.3565879 0.0140496 0.05047562
Bandwidth: 0.3826656
Kernel Regression Estimator: Local-Constant
Residual standard error: 2.833412
R-squared: 0.4148412
Continuous Kernel Type: Second-Order Gaussian
No. Continuous Explanatory Vars.: 1
> ## Back out the scale factor
> sf <- model$bw/(EssDee(model$index)*NROW(wage1)^{-1/5})
> sf
1.087715
>
> ## Some plots
>
> par(mfrow=c(2,2))
>
> hist(lwage)
> rug(lwage)
>
> hist(wage)
> rug(wage)
>
> plot(model$index,log(fitted(model)))
> plot(lmodel$index,fitted(lmodel),col=2)
>
> proc.time()
user system elapsed
50.828 0.424 51.416
Dear Jeffrey,
Thank you very much for your quick answer. Oddly enough, I had different results than these. It appears that we are not using exactly the same dataset. The data are the same but the nature of the variables is different. In wooldridge
package, the female
and married
variables are coded as numeric. In np
, they are factors.
I thought that the problem came from the fact that these binary variables were coded as numeric (and not factors) but it's not. Actually, the difference is
wooldridge
, the category of reference for female
is 0
(hence, male)np
, the category of reference for female
is Female
.
Because of this coding difference, it's normal to observe different theta vectors between the two (since the first variable is coded differently and its coefficient is set to 1). However, the amplitude of the thetas in my case (see model2 in the example below) is very odd. In the following example, I run two regressions
female
. The amplitude of the obtained thetas is extremely large. library(np)
data(wage1)
model1 <- npindex(wage ~ female + married + educ + exper + tenure,
data = wage1,
nmulti=1)
model1$beta
Data <- data.frame(
wage = wage1$wage,
female = factor(wage1$female,levels=c("Male","Female")),
married = wage1$married,
educ = wage1$educ,
exper = wage1$exper,
tenure = wage1$tenure
)
model2 <- npindex(wage ~ female + married + educ + exper + tenure,
data = Data,
nmulti=1)
model2$beta
Results :
model1$beta xdat2 xdat3 xdat4 xdat5 1.00000000 -0.32447740 0.35694446 0.01409685 0.05058387 model2$beta xdat2 xdat3 xdat4 xdat5 1.000 -873609.401 424638.436 5740.778 78160.615
Have you already encountered this kind of issue ?
Thank you once again !
Edit : Interestingly, it seems that scaling the response (dividing by its standard deviation) seems to solve the problem (meaning that the estimated thetas are back to "more likely" amplitudes).
model9 <- npindex(wage/sd(wage) ~ female + married + educ + exper + tenure,
data = wage1,
nmulti=1)
model9$beta ; model1$beta
model10 <- npindex(wage/sd(wage) ~ female + married + educ + exper + tenure,
data = Data,
nmulti=1)
model2$beta ; model10$beta
Results:
model9$beta ; model1$beta xdat2 xdat3 xdat4 xdat5 1.00000000 -0.32440232 0.35656960 0.01403641 0.05047819 xdat2 xdat3 xdat4 xdat5 1.00000000 -0.32447740 0.35694446 0.01409685 0.05058387 model2$beta ; model10$beta xdat2 xdat3 xdat4 xdat5 1.000 -873609.401 424638.436 5740.778 78160.615 xdat2 xdat3 xdat4 xdat5 1.00000000 -0.10161381 0.20144148 0.01680353 0.09288232
Alexandre
Greetings,
Most welcome and no worries...
You appear to be violating the data in an unnatural way... if you get odd results doing unnatural things, my advice would be the same as someone telling a doctor "it hurts when I do xxx"... the doctor may reply "then don't do xxx"...
Seriously, in R you code a factor via Xf <- factor(X) or simply X~factor(X)... so I would advise simply doing the standard thing and leave it at that (I don't have the energy or time to go digging into the nuances of overriding levels but feel free to dig deeper and report back)
re: scaling and magnitude... the thetas are only identified up to a constant so the norming by theta[1] <- 1 divides all other coefficients by the initial value of theta[1] and indeed if this value changes when you reassign levels and is small in one instance or the other this can have the effects you mention...
You are focusing on a coefficient vector... I would suggest instead looking at the model fit...
But personally I don't know why people use these models... it is schizophrenic to be agnostic about one part of the model yet maintain correct specification of the other... in other words, to me the coefficients are simply nuisance parameters... I want to know how the model predicts the data and new draws from the same DGP... what you are focusing on to my way of thinking is smoke and mirrors... but that is my cynical pre-coffee morning perspective, it will get better after I take my meds ;-)
Hello!
I am trying to fit a single-index model on the wage1 dataset (Wooldridge 2003). While the vignette (p27) chooses the log-wages as response variable, I choose to work on the (non-transformed) wages. However, this produces strange and unstable results. See the code below
Single-index regression on wages
Let's increase optim.reltol and allow for more multistart
Let's use this bandwidth as starting point
Change the random seed
Am I doing anything wrong ? What could be a sensible way to solve this issue ? By the way, I was wondering whether
bwtype
was supported in functionnpindexbw
. Setting for examplebwtype="adaptive_nn"
doesn't seem to change the obtained results.Thanks in advance !