DoubleML / doubleml-for-py

DoubleML - Double Machine Learning in Python
https://docs.doubleml.org
BSD 3-Clause "New" or "Revised" License
455 stars 68 forks source link

Attributes of nuisance functions #119

Closed prateekgv closed 1 year ago

prateekgv commented 2 years ago

Is it possible to access the attributes of the nuisance functions? For example, if the nuisance function is a RandomForestRegressor, then the sklearn package allows one to access the attributes such as estimators_, feature_importances_ etc. Attributes like feature_importances_ can perhaps help identify the confounding variables in the model.

PhilippBach commented 2 years ago

Dear @prateekgv ,

Thanks for your interest in our package. We agree that assessing the model attributes is a relevant feature to users with regard to additional diagnostics. The current version of DoubleML doesn't support exporting these attributes though. We will discuss this feature request at the next occasion and let you know about any changes. We'll leave this issue open until we have agreed on an implementation. In case you do some changes yourself, we appreciate a PR!

Once more, thank you!

Best,

Philipp

SvenKlaassen commented 1 year ago

The current version of DoubleML allows to save models trained during the crossfitting. This small example shows how to access models.

import numpy as np
import doubleml as dml
from doubleml.datasets import make_plr_CCDDHNR2018
from sklearn.ensemble import RandomForestRegressor
from sklearn.base import clone

learner = RandomForestRegressor(n_estimators=100, max_features=20, max_depth=5, min_samples_leaf=2)

ml_l = clone(learner)
ml_m = clone(learner)

np.random.seed(42)
data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=500, dim_x=20, return_type='DataFrame')

obj_dml_data = dml.DoubleMLData(data, 'y', 'd')
dml_plr_obj = dml.DoubleMLPLR(obj_dml_data, ml_l, ml_m)

dml_plr_obj.fit(store_models=True)

dml_plr_obj.models

This results in the following output

{'ml_l': {'d': [[RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2),
    RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2),
    RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2),
    RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2),
    RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)]]},
 'ml_m': {'d': [[RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2),
    RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2),
    RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2),
    RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2),
    RandomForestRegressor(max_depth=5, max_features=20, min_samples_leaf=2)]]}}

Remark that this estimation does include a lot of different models such that e.g. feature_importances_ can be accessed via

dml_plr_obj.models['ml_l']['d'][0][0].feature_importances_

where one has to specify the learner ml_m, the treatment d, the repetition index (only relevant if n_rep is greater than 1 ) and the fold index.

Output:

array([0.61409976, 0.02077143, 0.04488945, 0.01676835, 0.01991063,
       0.03105536, 0.02633215, 0.02430967, 0.01446739, 0.01645629,
       0.01145071, 0.02729037, 0.01306299, 0.02018805, 0.02620404,
       0.01579891, 0.01091846, 0.01715312, 0.01666732, 0.01220554])

I hope this clarifies how to access attributes of the fitted models.