Open SantiagoD999 opened 1 week ago
Hi @SantiagoD999
thanks!
We realized that for random forests it's basically impossible to get numerically identical results in Python and R due to built-in randomness (splitting the samples and randomly drawing the samples). As far as I remember, the differences are usually rather small and ignorable.
Could you try some of these things to see whether the differences are severe?
m
( I think you are running around 100 reps?)Thank you!
That you very much for your timely reply professor Bach. The issue with the implementation is R is that some times the coefficient is extreme even though in Python those extreme coefficients are not seen, I increased the number of observations but the R implementation is still with somewhat more outliers. Here I post an example of the extreme coefficients that I obtain with R and not with Python (and the same data):
library(DoubleML) library(mlr3) library(mlr3learners) library(data.table) library(reticulate) lgr::get_logger("mlr3")$set_threshold("warn") m<-100 n<-500 coef<-c() py_coef<-c() set.seed(1) for (i in 1:m){ ml_g = lrn("regr.ranger", num.trees = 100, mtry = 10, min.node.size = 2, max.depth = 5) ml_m = lrn("classif.ranger", num.trees = 100, mtry = 10, min.node.size = 2, max.depth = 5) data = make_irm_data(theta=0.5, n_obs=n, dim_x=10, return_type="data.table") obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d") dml_irm_obj = DoubleMLIRM$new(obj_dml_data, ml_g, ml_m) coef[i]<-dml_irm_obj$fit()$coef
py_code<-"
import numpy as np
import doubleml as dml
from doubleml.datasets import make_irm_data
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
ml_g = RandomForestRegressor(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)
ml_m = RandomForestClassifier(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)
obj_dml_data = dml.DoubleMLData(r.data, 'y', 'd')
dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m)
dml_irm_obj.fit() " py_run_string(py_code)
py_coef[i]<-py$dml_irm_obj$coef
} hist(coef) hist(py_coef,add=TRUE,col="blue")
min(coef) [1] -2459643217 (EXTREMELY LOW)
min(py_coef) [1] -0.3364994
Thank you very much,
Yes, looks like a numerical instability... Would you mind adding the histogram and maybe repeating the simulations with larger sample size or higher number of folds for cross-fitting (like 10 or 20)? For IRM, it could be that low propensity scores let the ATE estimator "explode". I don't know if there's a particular reason that this only happens in R but it could also be due to some random drawing. Did you try to vary the random seed?
Thank you very much for your reply, I increased the sample size from 500 to 1000 and there is no longer that explosion of the coefficient here I post the histogram.
Best.
Thanks. Maybe it's better to align the number of bins etc in the histogram
Did you try to change the seed (only)? I'm wondering if it's just bad luck from drawing/splitting the samples
Changing the seed and maintaining the n=500 sometimes I see the explosion of the coefficient in R but not in Python.
library(DoubleML) library(mlr3) library(mlr3learners) library(data.table) library(reticulate) lgr::get_logger("mlr3")$set_threshold("warn") m<-100 n<-500 coef<-c() py_coef<-c() set.seed(3) for (i in 1:m){ ml_g = lrn("regr.ranger", num.trees = 100, mtry = 10, min.node.size = 2, max.depth = 5) ml_m = lrn("classif.ranger", num.trees = 100, mtry = 10, min.node.size = 2, max.depth = 5) data = make_irm_data(theta=0.5, n_obs=n, dim_x=10, return_type="data.table") obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d") dml_irm_obj = DoubleMLIRM$new(obj_dml_data, ml_g, ml_m) coef[i]<-dml_irm_obj$fit()$coef
py_code<-"
import numpy as np
import doubleml as dml
from doubleml.datasets import make_irm_data
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
ml_g = RandomForestRegressor(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)
ml_m = RandomForestClassifier(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)
obj_dml_data = dml.DoubleMLData(r.data, 'y', 'd')
dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m)
dml_irm_obj.fit() " py_run_string(py_code)
py_coef[i]<-py$dml_irm_obj$coef print(i) }
max(coef) [1] 4532077637 max(py_coef) [1] 0.9980934
Thank you very much.
Describe the bug
I have been using the excellent DoubleML package for R, and running some simulations I have noticed that the "same" parametrization of a Random Forest model in R and Python produces different results. I am running a Monte Carlo simulation comparing both methods and R results seem more unstable than Python ones.
I would greatly appreciate any comment or advice.
Thank you very much.
Minimum reproducible code snippet
library(reticulate) library(DoubleML) library(mlr3) library(mlr3learners) library(data.table) lgr::get_logger("mlr3")$set_threshold("warn")
py_code<-"
import numpy as np
import doubleml as dml
from doubleml.datasets import make_irm_data
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
ml_g = RandomForestRegressor(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)
ml_m = RandomForestClassifier(n_estimators=100, max_features=10, max_depth=5, min_samples_leaf=2)
data = make_irm_data(theta=0.5, n_obs=500, dim_x=10, return_type='DataFrame')
obj_dml_data = dml.DoubleMLData(data, 'y', 'd')
dml_irm_obj = dml.DoubleMLIRM(obj_dml_data, ml_g, ml_m)
dml_irm_obj.fit() "
ml_g = lrn("regr.ranger", num.trees = 100, mtry = 10, min.node.size = 2, max.depth = 5) ml_m = lrn("classif.ranger", num.trees = 100, mtry = 10, min.node.size = 2, max.depth = 5)
coef_py<-NULL coef_r<-NULL for (i in 1:m){ py_run_string(py_code) coef_py[i]<-py$dml_irm_obj$coef print(i)
obj_dml_data = DoubleMLData$new(py$data, y_col="y", d_cols="d") dml_irm_obj = DoubleMLIRM$new(obj_dml_data , ml_g, ml_m) coef_r[i]<-dml_irm_obj$fit()$coef } hist(coef_py) hist(coef_r)
Expected Result
I would have expected an estimate similar to the true value, which is obtained in Python (around the true theta=0.5).
Actual Result
Different histograms for both implementations, perhaps this is due to the parametrization of Random Forest in Ranger versus in Scikit-Learn. In R sometimes the estimate of theta is extremely large (even a number that is around the millions).
Versions
Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Madrid tzcode source: internal
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] reticulate_1.40.0 data.table_1.16.2 mlr3learners_0.8.0 mlr3_0.21.1
[5] DoubleML_1.0.1 ranger_0.17.0
loaded via a namespace (and not attached): [1] utf8_1.2.4 future_1.34.0 readstata13_0.10.1
[4] shape_1.4.6.1 lattice_0.22-6 listenv_0.9.1
[7] digest_0.6.37 evaluate_1.0.1 grid_4.4.1
[10] iterators_1.0.14 mvtnorm_1.3-2 fastmap_1.2.0
[13] foreach_1.5.2 jsonlite_1.8.9 Matrix_1.7-1
[16] glmnet_4.1-8 backports_1.5.0 survival_3.7-0
[19] fansi_1.0.6 gets_0.38 mlr3tuning_1.2.0
[22] codetools_0.2-20 mlr3measures_1.0.0 palmerpenguins_0.1.1
[25] cli_3.6.3 rlang_1.1.4 crayon_1.5.3
[28] parallelly_1.39.0 future.apply_1.11.3 splines_4.4.1
[31] yaml_2.3.10 tools_4.4.1 parallel_4.4.1
[34] uuid_1.2-1 checkmate_2.3.2 globals_0.16.3
[37] bbotk_1.3.0 vctrs_0.6.5 R6_2.5.1
[40] png_0.1-8 zoo_1.8-12 lifecycle_1.0.4
[43] MASS_7.3-61 mlr3misc_0.15.1 pillar_1.9.0
[46] glue_1.8.0 Rcpp_1.0.13-1 clusterGeneration_1.3.8 [49] lgr_0.4.4 paradox_1.0.1 xfun_0.49
[52] rstudioapi_0.17.1 knitr_1.49 htmltools_0.5.8.1
[55] rmarkdown_2.29 compiler_4.4.1