MartinSpindler / hdm

Other
9 stars 7 forks source link

[documentation / examples] - numeric values of examples using growth data have slightly changed with latests versions #9

Closed timothee-clemessy closed 1 year ago

timothee-clemessy commented 1 year ago

While attempting to replicate the examples in python, I realized that the code for the GrowthData example p17 in the arxiv paper does not return the same values with the latest versions of the packages.

With version 0.10, the results are the same as those that are displayed in the paper :

library('devtools')
install_version("hdm", version = "0.1.0", repos = "http://cran.us.r-project.org")
data(GrowthData)

y = GrowthData[, 1, drop = F]
d = GrowthData[, 3, drop = F]
X = as.matrix(GrowthData)[, -c(1, 2, 3)]
varnames = colnames(GrowthData)
xnames = varnames[-c(1, 2, 3)] # names of X variables
dandxnames = varnames[-c(1, 2)] # names of D and X variables
# create formulas by pasting names (this saves typing times)
fmla = as.formula(paste("Outcome ~ ", paste(dandxnames, collapse = "+")))
ls.effect = lm(fmla, data = GrowthData)

dX = as.matrix(cbind(d, X))
lasso.effect = rlassoEffect(x = X, y = y, d=d, method = "partialling out")
summary(lasso.effect)

dX = as.matrix(cbind(d, X))
doublesel.effect = rlassoEffect(x = X, y = y, d = d, method = "double selection")
summary(doublesel.effect)

library(xtable)
table = rbind(summary(ls.effect)$coef["gdpsh465", 1:2], summary(lasso.effect)$coef[,1:2], summary(doublesel.effect)$coef[, 1:2])
colnames(table) = c("Estimate", "Std. Error")
rownames(table) = c("full reg via ols", "partial regvia post-lasso ", "partial reg via double selection")
print(table)

With newest version, they differ :

install.packages('hdm')
library('hdm')
packageVersion("hdm")
data(GrowthData)

y = GrowthData[, 1, drop = F]
d = GrowthData[, 3, drop = F]
X = as.matrix(GrowthData)[, -c(1, 2, 3)]
varnames = colnames(GrowthData)
xnames = varnames[-c(1, 2, 3)] # names of X variables
dandxnames = varnames[-c(1, 2)] # names of D and X variables
# create formulas by pasting names (this saves typing times)
fmla = as.formula(paste("Outcome ~ ", paste(dandxnames, collapse = "+")))
ls.effect = lm(fmla, data = GrowthData)

dX = as.matrix(cbind(d, X))
lasso.effect = rlassoEffect(x = X, y = y, d=d, method = "partialling out")
summary(lasso.effect)

dX = as.matrix(cbind(d, X))
doublesel.effect = rlassoEffect(x = X, y = y, d = d, method = "double selection")
summary(doublesel.effect)

library(xtable)
table = rbind(summary(ls.effect)$coef["gdpsh465", 1:2], summary(lasso.effect)$coef[,1:2], summary(doublesel.effect)$coef[, 1:2])
colnames(table) = c("Estimate", "Std. Error")
rownames(table) = c("full reg via ols", "partial regvia post-lasso ", "partial reg via double selection")
print(table)

Results :

[1] "Estimates and significance testing of the effect of target variables"
         Estimate. Std. Error t value Pr(>|t|)   
gdpsh465  -0.05001    0.01579  -3.167  0.00154 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

                                     Estimate Std. Error
full reg via ols                 -0.009377989 0.02988773
partial regvia post-lasso        -0.049811465 0.01393636
partial reg via double selection -0.050005855 0.01579138

So 0.045 have become 0.050

I've checked quickly the latest versions change logs and many things have changed in the code base since the very first version, so it likely comes comes from an improvement in the default parameters and/or the package, not a bug.

Though not really an issue per se, I thought I'd let you know and post it here in case someone is scratching his/her head over this somewhere in the Internet, or you were wondering whether it might be useful to update the documentation examples.

Best regards and thanks a lot for the package,

Timothée

PhilippBach commented 1 year ago

Dear @timothee-clemessy ,

thanks a lot for letting us know! I agree it's probably due to changes in the package. The papers available from CRAN show the updated results already. I guess, we have to update the arxiv version.

One question: Did you try to re-implement hdm in python yourself or did you play around with DoubleML for py ?

Best,

Philipp

timothee-clemessy commented 1 year ago
PhilippBach commented 1 year ago

Thanks @timothee-clemessy for your response.

I think you are right on this... Maybe one thing to add on this:

In general, it should be possible to use the rlasso() learner inside DoubleML. Of course, the algorithm would be a bit different than the double selection lasso algorithm, but I believe the difference will rather be small. The missing piece here is the integration of rlasso() as a learner in the mlr3extralearners package: https://mlr3extralearners.mlr-org.com/ ,... This is something, we have had on our list since we started DoubleML but unfortunately didn't implement. It should be feasible with manageable effort , I guess. Using rlasso() would then make it possible to estimate the causal parameter without sample splitting (because of the its theoretical properties)

As I think, there is nothing left to be done in the package, I'll close this issue.