[documentation / examples] - numeric values of examples using growth data have slightly changed with latests versions

timothee-clemessy commented 1 year ago

While attempting to replicate the examples in python, I realized that the code for the GrowthData example p17 in the arxiv paper does not return the same values with the latest versions of the packages.

With version 0.10, the results are the same as those that are displayed in the paper :

library('devtools')
install_version("hdm", version = "0.1.0", repos = "http://cran.us.r-project.org")
data(GrowthData)

y = GrowthData[, 1, drop = F]
d = GrowthData[, 3, drop = F]
X = as.matrix(GrowthData)[, -c(1, 2, 3)]
varnames = colnames(GrowthData)
xnames = varnames[-c(1, 2, 3)] # names of X variables
dandxnames = varnames[-c(1, 2)] # names of D and X variables
# create formulas by pasting names (this saves typing times)
fmla = as.formula(paste("Outcome ~ ", paste(dandxnames, collapse = "+")))
ls.effect = lm(fmla, data = GrowthData)

dX = as.matrix(cbind(d, X))
lasso.effect = rlassoEffect(x = X, y = y, d=d, method = "partialling out")
summary(lasso.effect)

dX = as.matrix(cbind(d, X))
doublesel.effect = rlassoEffect(x = X, y = y, d = d, method = "double selection")
summary(doublesel.effect)

library(xtable)
table = rbind(summary(ls.effect)$coef["gdpsh465", 1:2], summary(lasso.effect)$coef[,1:2], summary(doublesel.effect)$coef[, 1:2])
colnames(table) = c("Estimate", "Std. Error")
rownames(table) = c("full reg via ols", "partial regvia post-lasso ", "partial reg via double selection")
print(table)

With newest version, they differ :

install.packages('hdm')
library('hdm')
packageVersion("hdm")
data(GrowthData)

y = GrowthData[, 1, drop = F]
d = GrowthData[, 3, drop = F]
X = as.matrix(GrowthData)[, -c(1, 2, 3)]
varnames = colnames(GrowthData)
xnames = varnames[-c(1, 2, 3)] # names of X variables
dandxnames = varnames[-c(1, 2)] # names of D and X variables
# create formulas by pasting names (this saves typing times)
fmla = as.formula(paste("Outcome ~ ", paste(dandxnames, collapse = "+")))
ls.effect = lm(fmla, data = GrowthData)

dX = as.matrix(cbind(d, X))
lasso.effect = rlassoEffect(x = X, y = y, d=d, method = "partialling out")
summary(lasso.effect)

dX = as.matrix(cbind(d, X))
doublesel.effect = rlassoEffect(x = X, y = y, d = d, method = "double selection")
summary(doublesel.effect)

library(xtable)
table = rbind(summary(ls.effect)$coef["gdpsh465", 1:2], summary(lasso.effect)$coef[,1:2], summary(doublesel.effect)$coef[, 1:2])
colnames(table) = c("Estimate", "Std. Error")
rownames(table) = c("full reg via ols", "partial regvia post-lasso ", "partial reg via double selection")
print(table)

Results :

[1] "Estimates and significance testing of the effect of target variables"
         Estimate. Std. Error t value Pr(>|t|)   
gdpsh465  -0.05001    0.01579  -3.167  0.00154 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

                                     Estimate Std. Error
full reg via ols                 -0.009377989 0.02988773
partial regvia post-lasso        -0.049811465 0.01393636
partial reg via double selection -0.050005855 0.01579138

So 0.045 have become 0.050

I've checked quickly the latest versions change logs and many things have changed in the code base since the very first version, so it likely comes comes from an improvement in the default parameters and/or the package, not a bug.

Though not really an issue per se, I thought I'd let you know and post it here in case someone is scratching his/her head over this somewhere in the Internet, or you were wondering whether it might be useful to update the documentation examples.

Best regards and thanks a lot for the package,

Timothée

PhilippBach commented 1 year ago

Dear @timothee-clemessy ,

thanks a lot for letting us know! I agree it's probably due to changes in the package. The papers available from CRAN show the updated results already. I guess, we have to update the arxiv version.

One question: Did you try to re-implement hdm in python yourself or did you play around with DoubleML for py ?

Best,

Philipp

timothee-clemessy commented 1 year ago

Oh I am really sorry, I should have checked there first.
“Reproducing” was too strong a word. I wanted to compare results with selection methods available off the shelf in sklearn (namely bic on lars, with a bit of work cv with 1se rule, cv with min mse) on my potential use cases (eg ab test experiments with potentially HDM covariates), in the spirit of the comparisons of the dataset-dependent impact of the method chosen to pick lambda on the results in Angrist and Frandseno, and in Wuthrich and Zhu. More precisely, I want to check using offline AA tests simulations whether for my use cases using hdm would bring significantly different results from what I can build with sklearn and statsmodels. I know it is not optimal, but using R is costly for everyone in my team but me and make maintaining a pipeline much harder.
There have been a porting attempt in python https://github.com/maxhuppertz/hdmpy but the project does not seem to have been carried to its term, despite the intense initial effort of its author.
I have just started to get into doubleml methods and packages, so please don't hesitate to correct me if I’m wrong since I know that you are (of course !) way more knowledgeable than me on the topic but :
- double lasso does not rely on cross fitting and thus is not the same as using lasso in doubleml
- I would have 2*nb_folds models (one for each nuisance models), which makes it harder to debug and much slower to train
- I can’t access to nuisance models in the current version of doubleml
What I do leave with the double lasso approach is that it is :
- easy to train people already familiar with OLS on it
- really pedagogic in its presentation with the emphasis on treatment
- easy to check results because of the sparsity and the fact there are only three models, all perfectly understandable
- quite fast (3 fit)
To wrap-up, I would say that the use cases are not exactly the same : though high dimensional the double lasso method is not that far from shoe leather causal inference modelling because you can inspect the coefficients (I know the table 2 fallacy etc. but in my world of “fast modelling” in the tech sector it means I have a least one opportunity to realise that someone put a mediator or an endogenous variable in the predictive set when they present me the results table), while it is already harder for double ml which as an upside is of course more powerful
However, the doubleml API is awesome and there are some killer features (clustered standard errors), so it is in my plan to start using it ;)
Thanks a lot for your time and answer, and have a wonderful day. Also, thanks a lot for the work on the packages ;)

PhilippBach commented 1 year ago

Thanks @timothee-clemessy for your response.

I think you are right on this... Maybe one thing to add on this:

In general, it should be possible to use the rlasso() learner inside DoubleML. Of course, the algorithm would be a bit different than the double selection lasso algorithm, but I believe the difference will rather be small. The missing piece here is the integration of rlasso() as a learner in the mlr3extralearners package: https://mlr3extralearners.mlr-org.com/ ,... This is something, we have had on our list since we started DoubleML but unfortunately didn't implement. It should be feasible with manageable effort , I guess. Using rlasso() would then make it possible to estimate the causal parameter without sample splitting (because of the its theoretical properties)

As I think, there is nothing left to be done in the package, I'll close this issue.

MartinSpindler / hdm

[documentation / examples] - numeric values of examples using growth data have slightly changed with latests versions #9