DeclareDesign / fabricatr

fabricatr: Imagine Your Data Before You Collect It
https://declaredesign.org/r/fabricatr
Other
91 stars 11 forks source link

Total SD, take 2 #133

Open aaronrudkin opened 6 years ago

aaronrudkin commented 6 years ago

In #111, @chadhazlett proposed being able to specify a total standard deviation / variance for draw_normal_icc. I implemented this in early May -- in this implementation, the user supplies an ICC and a total_sd; we generate the ICC variable stochastically, fixing one of the sds as 1 and deriving the other from the ICC; then, the total_sd variable is used to rescale the variable at the end.

An advantage of this approach, which is the one I think Chad suggested, is that it ensures exact total standard deviation 100% of the time.

A disadvantage of this approach, as Neal mentioned, is that the rescaling will possibly distort the between group differences. Neal, instead, proposed noting that total = within + between. So basically, rather than a post-hoc scaling, you'd specify any two variables and get the other two. I'd have to work out the math, but this would basically leave us with two constraints; the ICC and one of within/between constrains the other of within/between and total; the total and within/between constrains the other of within/between and ICC. I would have to think a bit about what possible combinations of arguments we would allow.

I agree the solution I came up with is imperfect because the other three variables are targets for the stochasticity to hit, while total_sd is an exact mechanical consequence of the scaling.

Issuing this so that there can be some discussion.

nfultz commented 6 years ago

I think rescaling is generally not what people would expect, eg people don't usually expect

sd(rnorm(100)) == 1

exactly - there's sampling variability there, and I think that when ICC = 0 we should be the same as rnorm.