DoubleML / doubleml-for-py

DoubleML - Double Machine Learning in Python
https://docs.doubleml.org
BSD 3-Clause "New" or "Revised" License
480 stars 72 forks source link

[Bug]: type casting outcome_variable and treatment_variable(s) #232

Open hjk612 opened 6 months ago

hjk612 commented 6 months ago

Describe the bug

This is more of a nitpick :) I think there is an implicit assumption that the types of the outcome_variable and treatment_variable(s) should be float. So if we provide a dataframe to DoubleMLData where those variables are of type Decimal, the partialling out step fails with the error shown below. This is more of an issue specially when reading parquet files.

TypeError                                 Traceback (most recent call last)
Cell In[36], line 1
----> 1 dml_plr.fit(n_jobs_cv = -1)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml.py:605, in DoubleML.fit(self, n_jobs_cv, store_predictions, external_predictions, store_models)
    602         ext_prediction_dict[learner] = None
    604 # ml estimation of nuisance models and computation of score elements
--> 605 score_elements, preds = self._nuisance_est(self.__smpls, n_jobs_cv,
    606                                            external_predictions=ext_prediction_dict,
    607                                            return_models=store_models)
    609 self._set_score_elements(score_elements, self._i_rep, self._i_treat)
    611 # calculate rmses and store predictions and targets of the nuisance models

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml_plr.py:231, in DoubleMLPLR._nuisance_est(self, smpls, n_jobs_cv, external_predictions, return_models)
    226     g_hat = {'preds': external_predictions['ml_g'],
    227              'targets': None,
    228              'models': None}
    229 else:
    230     # get an initial estimate for theta using the partialling out score
--> 231     psi_a = -np.multiply(d - m_hat['preds'], d - m_hat['preds'])
    232     psi_b = np.multiply(d - m_hat['preds'], y - l_hat['preds'])
    233     theta_initial = -np.nanmean(psi_b) / np.nanmean(psi_a)

TypeError: unsupported operand type(s) for -: 'decimal.Decimal' and 'float'

Minimum reproducible code snippet

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LassoCV
from doubleml import DoubleMLData, DoubleMLPLR

df = pd.read_parquet("/...")

x_cols = [x for x in df.columns if "pre_" in x]
d_col = "event_action"
y_col = "post_outcome"

dml_data = DoubleMLData(df, y_col = y_col, d_cols=d_col, x_cols=x_cols)

learner = RandomForestRegressor(n_jobs = -1)
lasso = LassoCV()
dml_plr = DoubleMLPLR(dml_data, ml_l = learner, ml_g = learner, ml_m=lasso, score= "IV-type", n_folds = 2)
dml_plr.fit(n_jobs_cv = -1)

Expected Result

Model should fit successfully.

Actual Result

TypeError                                 Traceback (most recent call last)
Cell In[36], line 1
----> 1 dml_plr.fit(n_jobs_cv = -1)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml.py:605, in DoubleML.fit(self, n_jobs_cv, store_predictions, external_predictions, store_models)
    602         ext_prediction_dict[learner] = None
    604 # ml estimation of nuisance models and computation of score elements
--> 605 score_elements, preds = self._nuisance_est(self.__smpls, n_jobs_cv,
    606                                            external_predictions=ext_prediction_dict,
    607                                            return_models=store_models)
    609 self._set_score_elements(score_elements, self._i_rep, self._i_treat)
    611 # calculate rmses and store predictions and targets of the nuisance models

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml_plr.py:231, in DoubleMLPLR._nuisance_est(self, smpls, n_jobs_cv, external_predictions, return_models)
    226     g_hat = {'preds': external_predictions['ml_g'],
    227              'targets': None,
    228              'models': None}
    229 else:
    230     # get an initial estimate for theta using the partialling out score
--> 231     psi_a = -np.multiply(d - m_hat['preds'], d - m_hat['preds'])
    232     psi_b = np.multiply(d - m_hat['preds'], y - l_hat['preds'])
    233     theta_initial = -np.nanmean(psi_b) / np.nanmean(psi_a)

TypeError: unsupported operand type(s) for -: 'decimal.Decimal' and 'float'

Versions

Linux-5.10.205-195.807.amzn2.x86_64-x86_64-with-glibc2.26
Python 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:07:37) [GCC 12.3.0]
DoubleML 0.7.1
Scikit-Learn 1.3.2
SvenKlaassen commented 6 months ago

Thank you for highlighting this. The predictions created by sklearn are float type such that the partialling out step fails. I will try to add casting outcome and treatments