DoubleML / doubleml-for-r

DoubleML - Double Machine Learning in R
https://docs.doubleml.org
Other
126 stars 25 forks source link

Missleading entries in evaluated score functions & predictions in case of estimation without cross-fitting (`apply_cross_fitting = FALSE`) #96

Closed MalteKurz closed 3 years ago

MalteKurz commented 3 years ago

Description

When a DoubleML model is estimated with apply_cross_fitting = FALSE and n_folds = 2, there are misleading entries in the evaluated score functions as well as the exported predictions. Basically for all indices in the test set the entries are correct and also used for estimating the causal paramter(s), etc. However, for all indices which are not part of the test set, the predictions are filled up with zeros. These zero-predictions are then also later used when evaluating the score functions. These entries in psi, psi_a and psi_b are never used but in my view still misleading. In the case at hand, I would propose to fill the predictions and evaluated score function values with NA instead of zeros and non-meaningful values, respectively.

Example

> ml_g = lrn("regr.ranger", num.trees = 10, max.depth = 2)
> ml_m = ml_g$clone()
> obj_dml_data = make_plr_CCDDHNR2018(alpha = 0.5)
> dml_plr_obj = DoubleMLPLR$new(obj_dml_data, ml_g, ml_m,
+                               n_folds=2, apply_cross_fitting = FALSE)
> dml_plr_obj$fit(store_predictions = TRUE)
> dml_plr_obj$predictions$ml_g[1:10,,]
 [1]  0.0000000  0.5718869  0.7672342  0.6698870  0.0000000  1.5471172  1.1006015  0.0000000  0.0000000
[10] -0.2258972
> dml_plr_obj$psi[1:10]
 [1] -0.5875342  0.8229460 -0.3105735  0.6203550  0.2614734  0.7999844  1.1656477  0.3464782 -0.6397427
[10]  0.7832788
> obj_dml_data$data$y[1:10]*obj_dml_data$data$d[1:10] == dml_plr_obj$psi_b[1:10]
 [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
PhilippBach commented 3 years ago

Thanks for finding this bug. I agree, using NA instead of 0 is better here!