dscolby / CausalELM.jl

Taking causal inference to the extreme!
https://dscolby.github.io/CausalELM.jl/
MIT License
23 stars 0 forks source link

Add the marginal effect or the constant marginal effect #76

Open juandavidgutier opened 1 week ago

juandavidgutier commented 1 week ago

Hi @dscolby I work on implementing causal learning in eco-epidemiology, and I recently discovered CausalELM. I see important features in the package, such as the computation of G-computation and the E-value. However, it would be amazing if you could add the marginal effect or the constant marginal effect, along with its confidence interval.

dscolby commented 1 week ago

Hi @juandavidgutier, Thanks for the feedback. In theory the confidence intervals should be very straightforward to calculate via randomization inference like we do with the p-values. However, because it takes a long time to do this procedure (there is a matrix inversion every time a new model is estimated under a permutation of the treatment vector), this takes a really long time. So I think it would have to be disabled by default and enabled by setting the inference argument to true, as is the case with p-values. But I would also be interested if you have any faster alternative ways of computing p-values and confidence intervals in a nonparametric generic way that would work for all the models. For example, to get p-values (and confidence intervals) we would do this.

g_computer = GComputation(x, t, y)
estimate_causal_effect!(g_computer)
summarize(g_computer, inference=true)

I think this would definitely be feasible for the next release, though.

As for marginal effects, I know this is very straightforward for something like a logistic regression, but I'm not exactly sure how you would do it with multiple models, e.g. when using double machine learning or a metalearner. Do you have any references for this? I'm definitely open to it and I think it would be good to calculate marginal effects in the summarize method.

juandavidgutier commented 1 week ago

Hi @dscolby,

Unfortunately, I am not an expert in programming in Julia, but an option to modify the method to estimate confidence intervals could be to follow the documentation of the Python package EconML. In this case, (and as I understand it) the procedure could be as follows:

For Confidence Intervals (details at: https://econml.azurewebsites.net/_modules/econml/inference/_bootstrap.html#BootstrapEstimator)

  1. Bootstrapping Estimation: This method involves generating multiple estimates of the model using different bootstrap samples of the original data and storing the results for later analysis. 1.1. Bootstrap Sample Generation Sampling with replacement: Multiple samples (called resamples) are generated from the original data set by randomly selecting observations with replacement. Number of resamples (n_bootstrap_samples): A specific number of bootstrap samples to be generated is defined depending on the desired accuracy and available computational resources. 1.2. Model Fitting on Each Bootstrap Sample Independent Training: For each bootstrap sample generated, the estimator (model) of interest is fitted. Estimation of estimates: Estimates of the fitted model parameters are extracted from each sample. Storage of results: Estimates obtained from each fit are stored to build the empirical distribution of the estimators. 1.3. Handling Additional Parameters Parallelization: The fitting process on multiple bootstrap samples is also computationally intensive; therefore, in EconML, they use joblib to improve efficiency.
  2. Statistical Inference from Bootstrap Estimates: This step involves using the estimates obtained from the bootstrap samples to derive statistical inferences about the model parameters, such as point estimates, standard errors, and confidence intervals. 2.1. 2.1 Calculating Point Estimates and Standard Errors Point estimate: The mean of the bootstrap estimates is calculated to obtain a point estimate of the parameter. Standard error: The standard deviation of the bootstrap estimates is calculated to estimate the variability of the estimator. 2.2. 2.2 Constructing Confidence Intervals Percentile Method: Confidence intervals using bootstrapping are constructed by taking the corresponding percentiles of the bootstrap distribution. Confidence Levels: A desired confidence level is defined, and the associated percentiles are calculated (e.g., 2.5% and 97.5%). 2.3. Summary and Presentation of Results Consolidation of results: Point estimates, standard errors, and confidence intervals are combined into a consistent data structure for interpretation. Tabular presentation: Results are usually presented in tabular form to facilitate understanding and analysis.

For Marginal Effect, you could see the following documentation of EconML: Marginal effect: https://econml.azurewebsites.net/_autosummary/econml._cate_estimator.html?highlight=marginal%20effect#econml._cate_estimator.BaseCateEstimator.marginal_effect Constant marginal effect: https://econml.azurewebsites.net/_autosummary/econml.dml.DML.html?highlight=const_marginal_effect#econml.dml.DML.const_marginal_effect

dscolby commented 1 week ago

@juandavidgutier Bootstrapping runs into the same performance issues as randomization/permutation inference. I considered bootstrapping inference but ultimately went with randomization inference because it answers a slightly different question than bootstrapping. Bootstrapping is telling us what the probability of seeing an effect at least as extreme as the estimated effect is from some theoretical (normal) distribution. But randomization inference is telling us the proportion of times we would see an effect at least as extreme as the estimated effect under different treatment assignment mechanisms. Either way, I think I'll need to work on getting it parallelized, which is what EconML does. But getting the p-value is definitely feasible, so I'll work on that as I have time.

For the marginal effect, I was originally thinking about taking derivatives with estimators like R-learners that have multiple models, which would be tough, especially since each estimator is different. But it seems like other packages that use simpler estimators just use the finite difference approximation, which should also be pretty straightforward to implement. So, Il'' also get to work on this for the next release, but it will probably be slow going because I have a lot on my plate right now.

But again, thanks for the suggestions and references.

juandavidgutier commented 1 week ago

Hi @dscolby you are right, both methods of randomization inference and bootstrapping have the same performance issues. Thanks for listening to the suggestion.