akelleh / causality

Tools for causal analysis
MIT License
1.06k stars 128 forks source link

Difference between analysis and estimation #62

Closed fedorzh closed 2 years ago

fedorzh commented 6 years ago

I am a bit confused, what is the difference between Causal Analysis of the dataframe ("This method lets you control for a set of variables, z, when you're trying to estimate the effect"), and the causality.estimation module which "contains tools for estimating causal effects"? One uses Robin-G, and the other Propensity Score Matching? (or these methods serve different purposes?)

akelleh commented 6 years ago

Really good question! I'm not sure that I should maintain the difference, but maybe you have some opinions and we can sort that out.

Originally, the estimation module was meant as a place for different causal effect estimators to live. The analysis module would be tools that implement the causal effect estimators in a way that works with a typical data science workflow (e.g. manipulations of pandas dataframes).

In practice, I ended up implementing the Robins G-Formula estimators in the analysis package, but that was really a weekend worth of tech debt. I think to regain the original spirit of the division, I should remove it from the analysis module, put it back in the estimation module, and add interfaces to the other estimators through the analysis module.

One alternative might be to have more "data science" methods like g-formula estimation with machine learning estimators in the analysis package, and more "research" methods like PSM, Weighted OLS, etc. in the estimation package.

I think I like the former a little better for a couple of reasons. First, the software abstraction is much nicer. Second, it doesn't artificially draw a line between approaches to causal effect estimation.

Any thoughts?

fedorzh commented 6 years ago

Thanks for answering. I am only a beginner in the field, and starting to learn about various approaches to causality. Thus my opinion might not be grounded on experience.

But since you ask, I think the first approach makes much more sense: you have a set of core models/methods and a set of interfaces (through functions, or though CausalDataFrame). If you want distinction of the methods, I'd rather distinguish them by 1) purpose (e.g. treatment effect estimation, DAG determination, etc.), 2) possibly separate "research" (or "experimental") ones in a separate folder (subpackage) but not separate interfaces, if you want to have a set of "reliable ones" and those which are "for advanced users, and might not work in many cases".

akelleh commented 6 years ago

Thanks for the feedback! I'll pay the tech debt and implement the first approach next time I get a little time to work on the package. And don't worry about being newer to causal inference -- that's exactly the audience the package is for!

For "reliable" vs. "advanced", i figured I could implement the reliable methods as defaults, and advanced as optional through the same methods, but with extra args... the user-defined models in the causaldataframe.zplot method is a good example of the approach I'm proposing. Any opinion there?

fedorzh commented 6 years ago

"I could implement the reliable methods as defaults, and advanced as optional through the same methods, but with extra arg" This is exactly what I do in my packages, however, this only allows for one "reliable" - the default one. Not sure if that's what you want or not.

akelleh commented 6 years ago

Good point!

I'm not sure I like the idea of using lots of methods, since that can be kindof daunting to the user. I like what pandas.DataFrame.plot does with the different plot types. There's a ton of flexibility by changing a large number of kwargs.

I think there's a good compromise. For example, the zplot method three levels of difficulty: (1) It has a default of using a random forest regression. (2) with a single string kwarg (model_type='kernel') you can switch to a (slower, but often better) kernel density regression (without needing to know what that means). I'd regard this as "alternative reliable defaults". (3) Then, if you use the more advanced kwargs (model=<trained model object with a predict method>), advanced users can drop in a trained model for the maximum flexibility.

We could do something similar where we switch between different effect inference methods.