Quantco / metalearners

MetaLearners for CATE estimation
https://metalearners.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
28 stars 3 forks source link

Allow for passing of 'fixed' propensity scores #54

Open kklein opened 2 months ago

kklein commented 2 months ago

Several MetaLearners, such as the R-Learner or DR-Learner, have propensity base models.

As of now, they are trained -- just as all other base models -- based on the data passed through the MetaLearner's fit call.

In particular in cases of non-observational data, it might be interesting to pass 'fixed' propensity scores, as compared to trying to infer the propensities from the experiment data.

Next steps:

FrancescMartiEscofetQC commented 2 months ago
  • Assess different implementation options and their design implications (e.g. does creating a wrapped 'model' predicting on the end-user side do the trick? Is it a reasonable suggestion to provide no features to the propensity model? If not, should the scores be provided in __init__, fit, predict?)

Some ideas:

kklein commented 1 month ago

@ArseniyZvyagintsevQC Moving the discussion from the already closed PR to this open issue

https://github.com/Quantco/metalearners/pull/72#issuecomment-2267884739

I believe I understand your motivation of not only wanting to pass floats. A practical complication, though, is that this will cause problems at inference time for out-of-sample data. Granted, the predict methods of e.g. the R-Learner and DR-Learner don't necessitate the propensity model. Yet, if someone were to call predict_nuisance against the propensity model with is_oos=True and actual out-of-sample data, I imagine that this would lead to a nasty error that's difficult to understand for someone who isn't too familiar with the codebase.

What did you have in mind regarding the concern I'm raising?

Are you confident that learning the propensity model based on covariates won't effectively lead to a recovery of the fixed propensities?

ArseniyZvyagintsevQC commented 1 month ago

@kklein I do understand your concerns. I have two ideas in mind how we could not break the design & not raise nasty errors while still passing fixed prop scores: 1) When fitting, let's pass a vector prop_scores as an additional argument. If is_oos=False is called, the propensity model returns the corresponding values of the prop_scores vector. If is_oos=True, the propensity model returns 1/2. Note that this should not matter as we do not actually use prop_scores when making out-of-sample predictions (the one exception is the X-Learner, but 1/2 would also work. In the original paper authors do not insist on using propensity score as the coefficient for merging D0 and D1 models). In the end, even if we call predict_nuisance, this should not break: the prop model would return either the true propensity scores or 1/2 2) In a similar way to (1), pass prop_scores as a vector when fitting the metalearner. However, now we do it differently: We log-transform the prop_scores, create an additional column "prop_scores" for the X dataframe, update the feature_set such that the propensity model uses "prop_scores" column only, and change the propensity_model_factory to logistic regression. Why? Because now when we fit the model, we would actually have a propensity model that estimates propensity scores (in a very straightforward way though, as predicted log_odds = prop_scores). Whenever predict method is called, we just add a column "prop_scores" appropriately (with either real prop_scores if is_oos=False, or zeros if is_oos=True). No nasty error is rasied

Note that users can do something like (2) under current implementation. In my project I simply added one column to the data called prop_scores and specified the propensity_model to use this column only. It worked out nicely