Allow declarations from `declare_ra()` to inform estimators about study design

lukesonnet commented 6 years ago

We should allow users to pass a declaration created by declare_ra() that will learn whether the design is a simple random sample, a complete random sample, blocked, clustered, or blocked and clustered. Furthermore, if there is variance in the probability of treatment across the sample, then our estimators should default to using IPW in analysis.

A few choices:

We might consider adding a declaration_options argument that can take a list. It should take the following arguments to begin with:
- ipw = TRUE governing whether or not to take the probability of treatment and set weights to be the inverse propensity weights for difference in means or lm_robust
- treatment_condition = 2 governing which of the arms (column position in the declaration as well as condition name should work) to take as the treatment arm. The next argument would do the same for the control condition. This means that if someone declares a three-armed trial, they just have to tell us which arm to take as the treatment and which as the control. I think default behavior should be NULL so that it errors and says you have a multi-armed trial to please specify the two arms to compare. Another consideration, is whether we should allow this and the next argument to be numeric vectors. in that case if treatment_condition = c(2, 3) then we could treat both treatment conditions 2 and 3 as the same condition and group them together. Of course, it would be nice if there was a way to validate that the Z we are passed was the same as treatment_condition but there isn't a way that works across common use cases that I can see.
- control_condition = 1

The main hiccup I see with this is handling missing data and subsetting. I think if pre-subsetting nrow(data) is different from length(declaration$block_var) or length(declaration$clust_var) that we should error, and we should go ahead and apply subsetting to the declaration variables.

Another question: if someone passes declare_ra, it is clear what the implications are for horvitz_thompson and difference_in_means but not lm_robust. We cluster SEs at the cluster level, do we add dummy variables for the blocks if they aren't there? I think that's a little too paternalistic.

graemeblair commented 6 years ago

We might consider adding a declaration_options argument that can take a list. It should take the following arguments to begin with:

We've tried to avoid lists like this everywhere in the packages, so hope we can avoid it here and instead have options that are appropriately named for each possibility (like declaration_ipw or similar / less ugly).

treatment_condition = 2 governing which of the arms (column position in the declaration as well as condition name should work) to take as the treatment arm. The next argument would do the same for the control condition. This means that if someone declares a three-armed trial, they just have to tell us which arm to take as the treatment and which as the control. I think default behavior should be NULL so that it errors and says you have a multi-armed trial to please specify the two arms to compare. Another consideration, is whether we should allow this and the next argument to be numeric vectors. in that case if treatment_condition = c(2, 3) then we could treat both treatment conditions 2 and 3 as the same condition and group them together. Of course, it would be nice if there was a way to validate that the Z we are passed was the same as treatment_condition but there isn't a way that works across common use cases that I can see.

control_condition = 1

Not following why we need this, can you clarify? Why do we need to know?

The main hiccup I see with this is handling missing data and subsetting. I think if pre-subsetting nrow(data) is different from length(declaration$block_var) or length(declaration$clust_var) that we should error, and we should go ahead and apply subsetting to the declaration variables.

Definitely should error in this case, agree.

Another question: if someone passes declare_ra, it is clear what the implications are for horvitz_thompson and difference_in_means but not lm_robust. We cluster SEs at the cluster level, do we add dummy variables for the blocks if they aren't there? I think that's a little too paternalistic.

Agreed!

graemeblair commented 6 years ago

Following discussion with @macartan I'm open to adding the dummies. This inferring best analysis practices from the RA declaration is not the default behavior and maybe better to be consistently paternalistic (in this case by add the dummies) across different design choices.

lukesonnet commented 6 years ago

treatment_condition = 2 governing which of the arms (column position in the declaration as well as condition name should work) to take as the treatment arm. The next argument would do the same for the control condition. This means that if someone declares a three-armed trial, they just have to tell us which arm to take as the treatment and which as the control. I think default behavior should be NULL so that it errors and says you have a multi-armed trial to please specify the two arms to compare. Another consideration, is whether we should allow this and the next argument to be numeric vectors. in that case if treatment_condition = c(2, 3) then we could treat both treatment conditions 2 and 3 as the same condition and group them together. Of course, it would be nice if there was a way to validate that the Z we are passed was the same as treatment_condition but there isn't a way that works across common use cases that I can see. control_condition = 1

Not following why we need this, can you clarify? Why do we need to know?

We need to know because some declarations may have more than 2 treatment conditions, and you are trying to use the estimator to test the difference between 2 of the 3+ treatment conditions.

Following discussion with @macartan I'm open to adding the dummies. This inferring best analysis practices from the RA declaration is not the default behavior and maybe better to be consistently paternalistic (in this case by add the dummies) across different design choices.

Okay, I'll see if there's a good way to implement this (meaning checking if the dummies are already included or whether to throw a message or not).

lukesonnet commented 6 years ago

We've decided that:

If the length of the clusters, blocks, and condition probability matrix returned by declare_ra() is different from the length of the data supplied in data we error. I think this is correct and people should use their full data matrix if they are using declare. If they want to subset, they can use the subset argument which should work appropriately.
To deal with listwise deletion, blocks and clusters passed from a declare_ra() should be parsed by clean_model_data() to take advantage of existing infrastructure and informative warnings.

lukesonnet commented 6 years ago

Currently this is only done for horvitz_thompson(). I may add it to difference_in_means() if we have time, but I'm unlikely to do so before CRAN. Shifting this to post-CRAN.

DeclareDesign / estimatr

Allow declarations from `declare_ra()` to inform estimators about study design #29