eriqande / rubias

identifying and reducing bias in hierarchical GSI
2 stars 3 forks source link

Distribution around means for method = PB in infer_mixture #17

Open krshedd opened 6 years ago

krshedd commented 6 years ago

Hey Eric, we've been doing some more testing with a Cook Inlet Chinook SNP baseline and found that the parametric bootstrap method was able to substantially improve reporting group performance on known mixtures. However, if we were to use that feature for reporting it would be nice to show CIs or some other measure of variance around our estimates. Do you know of a good way to provide an estimate of variance from infer_mixture when method = "PB"? Would it be appropriate to multiply the corrected estimates of pi by the traces in order to get CIs? Apologies if we are off base. Thanks in advance.

eriqande commented 6 years ago

Hi Kyle!

That's cool to hear that the PB method improves things in other real scenarios.

The correction in the PB method involves adding or subtraction an amount to each of the collection mixing proportions. So, rather than multiplying you would want to add or subtract the amount back into the mixing proportion traces.

For example, let's say the estimated mixing proportion of collection i is p_i (without any parametric bootstrap). And let's say that with parametric bootstrap correction the estimated mixing proportion is p*_i.

Find the difference between those:

d_i = p*_i - p_i

Now, d_i is the amount that was added to p_i to try to make it less biased. So, if you had a trace of uncorrected values visited during the course of the chain, you could add d_i to each of those and use that as an approximate sample from a corrected posterior. Of course, you might get values <0 or greater than 1, in which case you would want to squash them to be 0 or 1.

This is sort of a hack, but given the way we tried to reduce the bias by a simple translation of the estimated values, it seems the most appropriate.

Cheers,

eric

eriqande commented 6 years ago

I should probably add a function at some point that does the process described above.

I will call that an enhancement.

krshedd commented 6 years ago

Hey Eric, that is what we thought, but just wanted to make sure we understood the method correctly. It should be pretty straightforward for us to do what you described on our end (solve for d_i and add to each trace with hard limits at 0 and 1). Thanks for the quick response!

krshedd commented 6 years ago

Actually, upon further thought, the bootstrapped_proportions tibble doesn't have output by collection (i), it is already summed up to reporting units. I suppose we'd need the the output to provide estimates of p*_i for the method you described above.

eriqande commented 6 years ago

Oops! My bad.

That is right Kyle. So, what you will need to do is imagine that p_i, p*_i, and d_i refer to quantities related to the i-th reporting unit, rather than collection, and do the same thing, but with reporting units rather than collections.

krshedd commented 6 years ago

Hey Eric, at the very end of bootstrap_rho.R you end up having to normalize rho_pb after squashing values <0 to 0 in order for stock comps to sum to 1. If we want to extract CIs from the parametric bootstrap corrected values with the method that you described above (i.e. apply d_i to each sweep of the trace to get a "bias corrected trace"), would we compute d_i before or after normalization? In other words, does: 1) d_i = rho_mean - rho_est (pre-normalization, first line below) or 2) d_i = rho_est - rho_pb (post-normalization, an additional line at the end)?

I could be wrong, but it seems like we'd end up with different "bias corrected" results depending on both what level we normalize (pi vs. rho) and when we normalize. Here are the lines in question from bootstrap_rho.R

  rho_pb <- rho_est - (rho_mean - rho_est)
  rho_pb[rho_pb < 0] <- 0
  rho_pb <- rho_pb/sum(rho_pb)
  rho_pb

Hope that makes sense. Thanks in advance!