Unstable results using npindexbw

Hello!

I am trying to fit a single-index model on the wage1 dataset (Wooldridge 2003). While the vignette (p27) chooses the log-wages as response variable, I choose to work on the (non-transformed) wages. However, this produces strange and unstable results. See the code below

library(wooldridge)
library(np)
data(wage1)

Single-index regression on wages

model <- npindexbw(wage ~ female + married + educ + exper + tenure,
                   data = wage1)
model$beta # Produces strange values
model$bw # This is extremely large

Let's increase optim.reltol and allow for more multistart

model.0 <- npindexbw(wage ~ female + married + educ + exper + tenure,
                     optim.reltol = 0.1,
                     nmulti = 200,
                     data = wage1)
model.0$beta # More likely
model.0$bw # same

Let's use this bandwidth as starting point

model.1 <- npindexbw(wage ~ female + married + educ + exper + tenure,
                     bws = model.0,
                     data = wage1)
model.1$beta 
model.1$bw

Change the random seed

model.2 <- npindexbw(wage ~ female + married + educ + exper + tenure,
                     optim.reltol = 0.1,
                     nmulti = 200,
                     data = wage1,
                     random.seed = 1000)
model.2$beta ; model.0$beta # Quite stable
model.2$bw ; model.0$bw # Same

model.3 <- npindexbw(wage ~ female + married + educ + exper + tenure,
                     bws = model.2,
                     data = wage1,
                     random.seed = 1000)
model.3$beta ; model.1$beta # Not so stable
model.3$bw ; model.1$bw # Same

Am I doing anything wrong ? What could be a sensible way to solve this issue ? By the way, I was wondering whether bwtype was supported in function npindexbw. Setting for example bwtype="adaptive_nn" doesn't seem to change the obtained results.

Thanks in advance !

Greetings,

Thanks for your interest in the package.

First, regarding "not so stable"... When you change optic.reltol to .1 expect very unstable results (default is 1.490116e-08). Try nmulti=1, 10, etc. Also, not sure what you mean by "stable" versus "not so stable"... numerical optimization with local minima can lead to instabilities which is why multi starting is highly recommended... below I use nmulti=1,5, 10,100, all produce the same results (up to optimization noise)

R version 3.6.2 (2019-12-12) -- "Dark and Stormy Night"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin19.3.0/x86_64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> rm(list=ls())
> library(np)
Nonparametric Kernel Methods for Mixed Datatypes (version 0.60-10)
[vignette("np_faq",package="np") provides answers to frequently asked questions]
[vignette("np",package="np") an overview]
[vignette("entropy_np",package="np") an overview of entropy-based methods]
> options(np.messages=FALSE)
> data(wage1)
> attach(wage1)
> 
> 
> ## Fit the model on wage
> 
> model <- npindex(wage ~ female + married + educ + exper + tenure, nmulti=1)
> coef(model)
     female     married        educ       exper      tenure 
 1.00000000 -0.32447740  0.35694446  0.01409685  0.05058387 
> summary(model)

Single Index Model
Regression Data: 526 training points, in 5 variable(s)

      female    married      educ      exper     tenure
Beta:      1 -0.3244774 0.3569445 0.01409685 0.05058387
Bandwidth: 0.382716
Kernel Regression Estimator: Local-Constant

Residual standard error: 2.833315
R-squared: 0.4148631

Continuous Kernel Type: Second-Order Gaussian
No. Continuous Explanatory Vars.: 1

> 
> model <- npindex(wage ~ female + married + educ + exper + tenure, nmulti=5)
> coef(model)
     female     married        educ       exper      tenure 
 1.00000000 -0.32404705  0.35658793  0.01404960  0.05047562 
> summary(model)

Single Index Model
Regression Data: 526 training points, in 5 variable(s)

      female   married      educ     exper     tenure
Beta:      1 -0.324047 0.3565879 0.0140496 0.05047562
Bandwidth: 0.3826656
Kernel Regression Estimator: Local-Constant

Residual standard error: 2.833412
R-squared: 0.4148412

Continuous Kernel Type: Second-Order Gaussian
No. Continuous Explanatory Vars.: 1

> 
> model <- npindex(wage ~ female + married + educ + exper + tenure, nmulti=10)
> coef(model)
     female     married        educ       exper      tenure 
 1.00000000 -0.32404705  0.35658793  0.01404960  0.05047562 
> summary(model)

Single Index Model
Regression Data: 526 training points, in 5 variable(s)

      female   married      educ     exper     tenure
Beta:      1 -0.324047 0.3565879 0.0140496 0.05047562
Bandwidth: 0.3826656
Kernel Regression Estimator: Local-Constant

Residual standard error: 2.833412
R-squared: 0.4148412

Continuous Kernel Type: Second-Order Gaussian
No. Continuous Explanatory Vars.: 1

> 
> model <- npindex(wage ~ female + married + educ + exper + tenure, nmulti=100)
> coef(model)
     female     married        educ       exper      tenure 
 1.00000000 -0.32425708  0.35657164  0.01404367  0.05047304 
> summary(model)

Single Index Model
Regression Data: 526 training points, in 5 variable(s)

      female    married      educ      exper     tenure
Beta:      1 -0.3242571 0.3565716 0.01404367 0.05047304
Bandwidth: 0.3826423
Kernel Regression Estimator: Local-Constant

Residual standard error: 2.833411
R-squared: 0.4148411

Continuous Kernel Type: Second-Order Gaussian
No. Continuous Explanatory Vars.: 1

> 
> proc.time()
   user  system elapsed 
582.190   2.588 586.219

Second, both the wage and log wage specifications provide comparable fits to the data (R^2 of 40-45%)... so I guess I am not sure what you were expecting...

R version 3.6.2 (2019-12-12) -- "Dark and Stormy Night"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin19.3.0/x86_64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> rm(list=ls())
> library(np)
Nonparametric Kernel Methods for Mixed Datatypes (version 0.60-10)
[vignette("np_faq",package="np") provides answers to frequently asked questions]
[vignette("np",package="np") an overview]
[vignette("entropy_np",package="np") an overview of entropy-based methods]
> options(np.messages=FALSE)
> data(wage1)
> attach(wage1)
> 
> ## Function for robust standard error used to recover "scale factor"
> ## (i.e., constant c in h=c*sd*n^{-1/5} where h is the bandwidth, sd
> ## robust standard error, n the sample size
> 
> EssDee <- function(y){
+   if(any(dim(as.matrix(y)) == 0))
+     return(0)
+   sd.vec <- apply(as.matrix(y),2,sd)
+   IQR.vec <- apply(as.matrix(y),2,IQR)/1.349
+   mad.vec <- apply(as.matrix(y),2,mad)
+   a <- apply(cbind(sd.vec,IQR.vec,mad.vec),1, function(x) max(x))
+   if(any(a<=0)) warning(paste("variable ",which(a<=0)," appears to be constant",sep=""))
+   a <- apply(cbind(sd.vec,IQR.vec,mad.vec),1, function(x) min(x[x>0]))  
+   return(a)
+ }
> 
> ## Fit the model on logwage
> 
> lmodel <- npindex(lwage ~ female + married + educ + exper + tenure)
> coef(lmodel)
      female      married         educ        exper       tenure 
 1.000000000 -0.105668111  0.059893676  0.001208391  0.014349667 
> summary(lmodel)

Single Index Model
Regression Data: 526 training points, in 5 variable(s)

      female    married       educ       exper     tenure
Beta:      1 -0.1056681 0.05989368 0.001208391 0.01434967
Bandwidth: 0.07319441
Kernel Regression Estimator: Local-Constant

Residual standard error: 0.4011912
R-squared: 0.431404

Continuous Kernel Type: Second-Order Gaussian
No. Continuous Explanatory Vars.: 1

> ## Back out the scale factor
> lsf <- lmodel$bw/(EssDee(lmodel$index)*NROW(wage1)^{-1/5})
> lsf

0.4410832 
> 
> ## Fit the model on wage
> 
> model <- npindex(wage ~ female + married + educ + exper + tenure)
> coef(model)
     female     married        educ       exper      tenure 
 1.00000000 -0.32404705  0.35658793  0.01404960  0.05047562 
> summary(model)

Single Index Model
Regression Data: 526 training points, in 5 variable(s)

      female   married      educ     exper     tenure
Beta:      1 -0.324047 0.3565879 0.0140496 0.05047562
Bandwidth: 0.3826656
Kernel Regression Estimator: Local-Constant

Residual standard error: 2.833412
R-squared: 0.4148412

Continuous Kernel Type: Second-Order Gaussian
No. Continuous Explanatory Vars.: 1

> ## Back out the scale factor
> sf <- model$bw/(EssDee(model$index)*NROW(wage1)^{-1/5})
> sf

1.087715 
> 
> ## Some plots
> 
> par(mfrow=c(2,2))
> 
> hist(lwage)
> rug(lwage)
> 
> hist(wage)
> rug(wage)
> 
> plot(model$index,log(fitted(model)))
> plot(lmodel$index,fitted(lmodel),col=2)
> 
> proc.time()
   user  system elapsed 
 50.828   0.424  51.416

Dear Jeffrey,

Thank you very much for your quick answer. Oddly enough, I had different results than these. It appears that we are not using exactly the same dataset. The data are the same but the nature of the variables is different. In wooldridge package, the female and married variables are coded as numeric. In np, they are factors.

I thought that the problem came from the fact that these binary variables were coded as numeric (and not factors) but it's not. Actually, the difference is

in wooldridge, the category of reference for female is 0 (hence, male)
in np, the category of reference for female is Female. Because of this coding difference, it's normal to observe different theta vectors between the two (since the first variable is coded differently and its coefficient is set to 1). However, the amplitude of the thetas in my case (see model2 in the example below) is very odd.

In the following example, I run two regressions

The first is exactly the same as the first one in your answer. I do obtain the same results as you.
In the second, I change only one thing : the order of the levels for variable female. The amplitude of the obtained thetas is extremely large.

library(np)
data(wage1)

model1 <- npindex(wage ~ female + married + educ + exper + tenure,
                  data = wage1,
                  nmulti=1)
model1$beta

Data <- data.frame(
  wage = wage1$wage,
  female = factor(wage1$female,levels=c("Male","Female")),
  married = wage1$married,
  educ = wage1$educ,
  exper = wage1$exper,
  tenure = wage1$tenure
)

model2 <- npindex(wage ~ female + married + educ + exper + tenure,
                  data = Data,
                  nmulti=1)
model2$beta

Results :

model1$beta xdat2 xdat3 xdat4 xdat5 1.00000000 -0.32447740 0.35694446 0.01409685 0.05058387 model2$beta xdat2 xdat3 xdat4 xdat5 1.000 -873609.401 424638.436 5740.778 78160.615

Have you already encountered this kind of issue ?

Thank you once again !

Edit : Interestingly, it seems that scaling the response (dividing by its standard deviation) seems to solve the problem (meaning that the estimated thetas are back to "more likely" amplitudes).

model9 <- npindex(wage/sd(wage) ~ female + married + educ + exper + tenure,
                  data = wage1,
                  nmulti=1)
model9$beta ; model1$beta

model10 <- npindex(wage/sd(wage) ~ female + married + educ + exper + tenure,
                  data = Data,
                  nmulti=1)
model2$beta ; model10$beta

Results:

model9$beta ; model1$beta xdat2 xdat3 xdat4 xdat5 1.00000000 -0.32440232 0.35656960 0.01403641 0.05047819 xdat2 xdat3 xdat4 xdat5 1.00000000 -0.32447740 0.35694446 0.01409685 0.05058387 model2$beta ; model10$beta xdat2 xdat3 xdat4 xdat5 1.000 -873609.401 424638.436 5740.778 78160.615 xdat2 xdat3 xdat4 xdat5 1.00000000 -0.10161381 0.20144148 0.01680353 0.09288232

Alexandre

Greetings,

Most welcome and no worries...

You appear to be violating the data in an unnatural way... if you get odd results doing unnatural things, my advice would be the same as someone telling a doctor "it hurts when I do xxx"... the doctor may reply "then don't do xxx"...

Seriously, in R you code a factor via Xf <- factor(X) or simply X~factor(X)... so I would advise simply doing the standard thing and leave it at that (I don't have the energy or time to go digging into the nuances of overriding levels but feel free to dig deeper and report back)

re: scaling and magnitude... the thetas are only identified up to a constant so the norming by theta[1] <- 1 divides all other coefficients by the initial value of theta[1] and indeed if this value changes when you reassign levels and is small in one instance or the other this can have the effects you mention...

You are focusing on a coefficient vector... I would suggest instead looking at the model fit...

But personally I don't know why people use these models... it is schizophrenic to be agnostic about one part of the model yet maintain correct specification of the other... in other words, to me the coefficients are simply nuisance parameters... I want to know how the model predicts the data and new draws from the same DGP... what you are focusing on to my way of thinking is smoke and mirrors... but that is my cynical pre-coffee morning perspective, it will get better after I take my meds ;-)

JeffreyRacine / R-Package-np