DoubleML / doubleml-for-py

DoubleML - Double Machine Learning in Python
https://docs.doubleml.org
BSD 3-Clause "New" or "Revised" License
454 stars 68 forks source link

Estimators enforcing no null confounder values #120

Closed morelandjs closed 2 years ago

morelandjs commented 2 years ago

I tried using the package with XGBoost to estimate the ml_g and ml_m terms. The existence of nulls is no problem for XGBoost as it is able to infer the correct branch split for null values empirically. Indeed, XGBoost is commonly used to estimate propensity scores when the confounding features are potentially missing/null.

Unfortunately, the doubleml package calls sklearn.utils.check_X_y, and is configured to throw an error if there are missing values in the confounders X. This occurs several times in https://github.com/DoubleML/doubleml-for-py/blob/master/doubleml/double_ml_plr.py

It seems the fix is as simple as allowing users to pass the check_X_y kwarg force_all_finite=False, e.g. see https://scikit-learn.org/stable/modules/generated/sklearn.utils.check_X_y.html. Once this is changed, I've found that an XGBoost model with missing values runs with no issue. Naturally, the response variable y still cannot have null values.

MalteKurz commented 2 years ago

Hi @morelandjs,

thanks for your feedback. I agree that the checks are in this regard too strict and obviously it would be great to allow the usage of XGBoost learners in the described scenario.

To be honest, I am struggling a bit with passing through arguments to the check_X_y() method. During the whole double machine learning algorithm we call several utility functions. I am a bit afraid that the API is quickly becoming messy if we add such passing through options for all the utility functions ......

My proposal would be the following:

morelandjs commented 2 years ago

As proposed, we could call check_X_y() with force_all_finite=False. Note that this only allows NA's and infinite values for the confounders X but not for the target variables (in the DoubleML case y and d) https://github.com/scikit-learn/scikit-learn/blob/2beed55847ee70d363bdbfe14ee4401438fba057/sklearn/utils/validation.py#L814.

This seems reasonable to me. If an estimator cannot process NULL values, the estimator itself should fail or throw a warning. It does not appear to be a requirement of doubleML itself, although I am no expert.

Important for the DoubleML use case are finite values for the evaluated score functions. This usually means that we need finite values for the response variable y, the target variable d and the predictions obtained from the nuisance estimation. --> I will add checks accordingly, but need to go through all the different settings for sample splitting.

This may already be the behavior for force_all_finite=False. I think it only relaxes the constraint on X. From the source code docstring:

"force_all_finite : bool or 'allow-nan', default=True Whether to raise an error on np.inf, np.nan, pd.NA in X. This parameter does not influence whether y can have np.inf, np.nan, pd.NA values."

I still think about whether we may want to check somewhere whether the confounders are finite / not missing and basically give a warning stating that this may be a problem for some learners. Presumably such a check is best placed in the data backend DoubleMLData and we could add a force_all_X_finite flag there such that it is possible to deactivate the check.

I think a warning would be good.

Thanks for thinking about this and responding so quickly! Love the package, it's well made!

MalteKurz commented 2 years ago

Hi @morelandjs, this is now fixed. See #122 for details. Would be great if you could give it a try (just install the current dev version from GitHub master) and let me know whether it now works for your case. Thanks, Malte

morelandjs commented 2 years ago

Awesome! Thank you. I'll check it out.