[Bug]: - Githubissues

SantiagoD999 commented 1 week ago

Describe the bug

I have been using the excellent DoubleML package for R, and running some simulations I have noticed that the "same" parametrization of a Random Forest model in R and Python produces different results. I am running a Monte Carlo simulation comparing both methods and R results seem more unstable than Python ones.

I would greatly appreciate any comment or advice.

Thank you very much.

Minimum reproducible code snippet

library(reticulate) library(DoubleML) library(mlr3) library(mlr3learners) library(data.table) lgr::get_logger("mlr3")$set_threshold("warn")

py_code<-"

import numpy as np

import doubleml as dml

from doubleml.datasets import make_irm_data

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

ml_g = RandomForestRegressor(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)

ml_m = RandomForestClassifier(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)

data = make_irm_data(theta=0.5, n_obs=500, dim_x=10, return_type='DataFrame')

obj_dml_data = dml.DoubleMLData(data, 'y', 'd')

dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m)

dml_irm_obj.fit() "

ml_g = lrn("regr.ranger", num.trees = 100, mtry = 10, min.node.size = 2, max.depth = 5) ml_m = lrn("classif.ranger", num.trees = 100, mtry = 10, min.node.size = 2, max.depth = 5)

coef_py<-NULL coef_r<-NULL for (i in 1:m){ py_run_string(py_code) coef_py[i]<-py$dml_irm_obj$coef print(i)

obj_dml_data = DoubleMLData$new(py$data, y_col="y", d_cols="d") dml_irm_obj = DoubleMLIRM$new(obj_dml_data , ml_g, ml_m) coef_r[i]<-dml_irm_obj$fit()$coef } hist(coef_py) hist(coef_r)

Expected Result

I would have expected an estimate similar to the true value, which is obtained in Python (around the true theta=0.5).

Actual Result

Different histograms for both implementations, perhaps this is due to the parametrization of Random Forest in Ranger versus in Scikit-Learn. In R sometimes the estimate of theta is extremely large (even a number that is around the millions).

Versions

sessionInfo() R version 4.4.1 (2024-06-14) Platform: aarch64-apple-darwin20 Running under: macOS 15.1.1

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Madrid tzcode source: internal

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] reticulate_1.40.0 data.table_1.16.2 mlr3learners_0.8.0 mlr3_0.21.1
[5] DoubleML_1.0.1 ranger_0.17.0

loaded via a namespace (and not attached): [1] utf8_1.2.4 future_1.34.0 readstata13_0.10.1
[4] shape_1.4.6.1 lattice_0.22-6 listenv_0.9.1
[7] digest_0.6.37 evaluate_1.0.1 grid_4.4.1
[10] iterators_1.0.14 mvtnorm_1.3-2 fastmap_1.2.0
[13] foreach_1.5.2 jsonlite_1.8.9 Matrix_1.7-1
[16] glmnet_4.1-8 backports_1.5.0 survival_3.7-0
[19] fansi_1.0.6 gets_0.38 mlr3tuning_1.2.0
[22] codetools_0.2-20 mlr3measures_1.0.0 palmerpenguins_0.1.1
[25] cli_3.6.3 rlang_1.1.4 crayon_1.5.3
[28] parallelly_1.39.0 future.apply_1.11.3 splines_4.4.1
[31] yaml_2.3.10 tools_4.4.1 parallel_4.4.1
[34] uuid_1.2-1 checkmate_2.3.2 globals_0.16.3
[37] bbotk_1.3.0 vctrs_0.6.5 R6_2.5.1
[40] png_0.1-8 zoo_1.8-12 lifecycle_1.0.4
[43] MASS_7.3-61 mlr3misc_0.15.1 pillar_1.9.0
[46] glue_1.8.0 Rcpp_1.0.13-1 clusterGeneration_1.3.8 [49] lgr_0.4.4 paradox_1.0.1 xfun_0.49
[52] rstudioapi_0.17.1 knitr_1.49 htmltools_0.5.8.1
[55] rmarkdown_2.29 compiler_4.4.1

packageVersion('DoubleML') [1] ‘1.0.1’ packageVersion('mlr3') [1] ‘0.21.1’

PhilippBach commented 4 days ago

Hi @SantiagoD999

thanks!

We realized that for random forests it's basically impossible to get numerically identical results in Python and R due to built-in randomness (splitting the samples and randomly drawing the samples). As far as I remember, the differences are usually rather small and ignorable.

Could you try some of these things to see whether the differences are severe?

Increase the number of folds and/or the number of repetitions for cross-fitting ?
Increase the number of repetitions m ( I think you are running around 100 reps?)
Increase the number of observations, say to a 1000 or so
Align the axis / bins for the histrograms in Python and R?

Thank you!

SantiagoD999 commented 4 days ago

That you very much for your timely reply professor Bach. The issue with the implementation is R is that some times the coefficient is extreme even though in Python those extreme coefficients are not seen, I increased the number of observations but the R implementation is still with somewhat more outliers. Here I post an example of the extreme coefficients that I obtain with R and not with Python (and the same data):

library(DoubleML) library(mlr3) library(mlr3learners) library(data.table) library(reticulate) lgr::get_logger("mlr3")$set_threshold("warn") m<-100 n<-500 coef<-c() py_coef<-c() set.seed(1) for (i in 1:m){ ml_g = lrn("regr.ranger", num.trees = 100, mtry = 10, min.node.size = 2, max.depth = 5) ml_m = lrn("classif.ranger", num.trees = 100, mtry = 10, min.node.size = 2, max.depth = 5) data = make_irm_data(theta=0.5, n_obs=n, dim_x=10, return_type="data.table") obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d") dml_irm_obj = DoubleMLIRM$new(obj_dml_data, ml_g, ml_m) coef[i]<-dml_irm_obj$fit()$coef

py_code<-"

import numpy as np

import doubleml as dml

from doubleml.datasets import make_irm_data

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

ml_g = RandomForestRegressor(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)

ml_m = RandomForestClassifier(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)

obj_dml_data = dml.DoubleMLData(r.data, 'y', 'd')

dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m)

dml_irm_obj.fit() " py_run_string(py_code)

py_coef[i]<-py$dml_irm_obj$coef

} hist(coef) hist(py_coef,add=TRUE,col="blue")

min(coef) [1] -2459643217 (EXTREMELY LOW)

min(py_coef) [1] -0.3364994

Thank you very much,

PhilippBach commented 3 days ago

Yes, looks like a numerical instability... Would you mind adding the histogram and maybe repeating the simulations with larger sample size or higher number of folds for cross-fitting (like 10 or 20)? For IRM, it could be that low propensity scores let the ATE estimator "explode". I don't know if there's a particular reason that this only happens in R but it could also be due to some random drawing. Did you try to vary the random seed?

SantiagoD999 commented 3 days ago

Thank you very much for your reply, I increased the sample size from 500 to 1000 and there is no longer that explosion of the coefficient here I post the histogram.

Best.

PhilippBach commented 2 days ago

Thanks. Maybe it's better to align the number of bins etc in the histogram

Did you try to change the seed (only)? I'm wondering if it's just bad luck from drawing/splitting the samples

SantiagoD999 commented 2 days ago

Changing the seed and maintaining the n=500 sometimes I see the explosion of the coefficient in R but not in Python.

library(DoubleML) library(mlr3) library(mlr3learners) library(data.table) library(reticulate) lgr::get_logger("mlr3")$set_threshold("warn") m<-100 n<-500 coef<-c() py_coef<-c() set.seed(3) for (i in 1:m){ ml_g = lrn("regr.ranger", num.trees = 100, mtry = 10, min.node.size = 2, max.depth = 5) ml_m = lrn("classif.ranger", num.trees = 100, mtry = 10, min.node.size = 2, max.depth = 5) data = make_irm_data(theta=0.5, n_obs=n, dim_x=10, return_type="data.table") obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d") dml_irm_obj = DoubleMLIRM$new(obj_dml_data, ml_g, ml_m) coef[i]<-dml_irm_obj$fit()$coef

py_code<-"

import numpy as np

import doubleml as dml

from doubleml.datasets import make_irm_data

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

ml_g = RandomForestRegressor(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)

ml_m = RandomForestClassifier(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)

obj_dml_data = dml.DoubleMLData(r.data, 'y', 'd')

dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m)

dml_irm_obj.fit() " py_run_string(py_code)

py_coef[i]<-py$dml_irm_obj$coef print(i) }

max(coef) [1] 4532077637 max(py_coef) [1] 0.9980934

Thank you very much.

DoubleML / doubleml-for-r

[Bug]: #207

Describe the bug

Minimum reproducible code snippet

Expected Result

Actual Result

Versions