acoppock / Green-Lab-SOP

Standard Operating Procedures for Don Green's Lab at Columbia
51 stars 19 forks source link

Can we include more than M/20 covariates if we use Lasso? #3

Open acoppock opened 9 years ago

acoppock commented 9 years ago

Jas's paper on this seems like a nice way to include arbitrarily many covariates without needing to worry about degrees of freedom? Plus then you don't need to worry as much about selecting the few you include based on "principle"?

http://sekhon.berkeley.edu/papers/lasso.pdf

linstonwin commented 9 years ago

Thanks, Alex. I've read an earlier draft of this paper and definitely want to read the new version when I have time.

For now, we aren't including lasso or other automated model selection methods in our SOP. But, as always, PAPs are free to deviate from the SOP (and please feel free to let me know if you want to discuss anything as you write PAPs).

We'll consider adding such methods to the SOP in the future. In the Safety Net essay, we wrote:

"We note in closing that we do not regard all of the defaults in our SOP as clearly superior to the alternatives. For example, in the section on covariate adjustment, we recommend that covariates be pre-specified 'on the basis of their expected ability to help predict outcomes,' give rules of thumb for the maximum number of covariates, and suggest how a jury can be used in exceptional cases (e.g., when a new source of baseline data becomes available after random assignment). We considered the alternative of adopting automated model selection methods, but would like to see more evidence that (1) valid confidence intervals can be constructed when such methods are used and (2) the benefits of such methods (possible improvements in precision) outweigh the costs (increased computing time, possible loss of transparency to non-expert readers). This is just one example of a topic where, as the literature advances and evolves, our SOP may evolve as well."

(Increased computing time might become a nontrivial issue when lasso selection of covariates is combined with other computation-intensive methods, such as permutation tests or resampling-based multiple-comparisons corrections.)

So far I'm not inclined to make lasso our default method for covariate selection, but I'm interested in reading more and seeing more evidence on these issues, and I'd be happy to discuss this more sometime.