Censoring "bad" estimates

kalebentley commented 3 years ago

I noticed that we have been inconsistent as to how we have been censoring "bad" estimates of abundance. Specifically, some are removed directly from our data files (entered as NA) while others have seemingly been omitted somewhere in the data summarization code. We should be consistent and I propose censoring any bad estimates during the data summarization process in R but open to suggestions.

Over the past week, I've been reviewing past estimates of abundance and compiled a list of estimates that should be censored. Some have already been censored (again, either in the data files or R script) while others haven't.

Below is a list of all estimates that should be censored, summarized by brood year, population, estimate (type; adult vs. juvenile), whether or not the estimate has already been censored (Y/N) and where (data file, R script), and a short summary as to why the estimate should be censored.

For the estimates that have been already censored from the data files, I could go back and add these estimates (again for consistency). I realize this isn't that big of a deal but I'll wait to do this until we decide as a group what to do.

I am open to ideas as to how best to go about implementing censorship. One idea - would be to maintain a "censorship" list in a data file. I've uploaded a file to our "data" folder if this is of interest.

@ebuhle Thoughts?

tbuehrens commented 3 years ago

@kalebentley @ebuhle I'd add all estimates to the datafile, and add a column to flag them as bad so that all censoring happens in R (which I believe is what you propose)

Hillsont commented 3 years ago

Agreed, keep the data files as “rich” as we can, censor as needed via code.

Thanks

TH

ebuhle commented 3 years ago

Thanks for pulling this together, @kalebentley. My gut reaction is that throwing out observations in an already somewhat sparse data set (relative to the complexity of the model we're trying to fit) is a bummer. Granted, it's only 12/200 non-NA spawner observations and 1/61 for smolts, but it's worth at least thinking about whether some of them are salvageable. Also, those 12 new NAs in S_obs would be very much not missing at random in space or time, which raises its own concerns.

This issue, as framed, is really more a matter of missingness (created based on data quality) than of censoring. Maybe the former is sometimes a special case of the latter, but typically censoring means you only know the value of a variable to within some interval. I'm not (just) being pedantic; it looks like 12 of these "bad data quality" cases, including 3 that have already been excluded, are in fact right-censored. Depending on how the Min # (biased low) was arrived at -- and please fill me in here -- it should be possible to modify the observation likelihood / informative prior for those observations accordingly.

As for workflow, I agree with @tbuehrens and @Hillsont. The simplest thing would be to just add a column to the existing raw data files to flag "bad" observations and deal with them on the R side.

ebuhle commented 3 years ago

Another thing I think is germane to this issue is one I've discussed with @tbuehrens and @kalebentley recently, which we might call "bad" observation error estimates. I'm referring to the often suspiciously precise spawner and smolt abundance: for spawners, 40% of tau_S_obs values (computed directly from reported posterior summary statistics) are < 0.1 and 20% are < 0.05; for smolts, 63% of tau_M_obs are < 0.1 and 37% are < 0.05.

What could possibly be wrong with highly precise estimates? Well, nothing, if the precision is accurately estimated and the sample is representative of the population. But as @tbuehrens has pointed out, all of the observation models make assumptions that are not always met. And @kalebentley has mentioned smolt estimates where method == "Census" but volumetric measurement was used to extrapolate smolt numbers. Overestimating observation precision will force the states to match the point observations almost perfectly (as they do in nearly all cases), which attributes nearly all the stochasticity to the process dynamics themselves. These huge stage-specific process errors then cause the forecasts (or hindcasts / interpolations) to blow up immediately.

Spurious precision is a familiar problem in stock assessment, especially w.r.t. overdispersed compositional data. This is a slightly different beast; it's more about deciding how much weight to give to the sample-based "informative priors" on smolt or spawner abundance, vs. some form of model-based smoothing in the IPM or case-by-case judgment call as in this thread.

We've covered some of this ground before, but the model has come a long way since then and the effects of specific components are clearer. At some early stage, I think I tried actually replacing the "known" tau_[x]_obs values, not just the missing ones, with draws from the hyperdistribution -- a rather aggressive form of model-based smoothing. @tbuehrens suggested using analysis type as a sort of fixed covariate for the hyperparameters of tau_[x], so that more precise methods would have lower CVs on average. At the time I thought that was getting ahead of ourselves with bells and whistles, but we're at a point now where it might be worth considering some refinement like that.

Anyway, I don't have any silver bullets, but maybe a place to start is by taking a fine-toothed comb to the error estimates as @kalebentley has done for the point estimates, and flagging any cases (e.g., the fry/L ones) where we might suspect spurious precision.

Thoughts?

ebuhle / LCRchumIPM

Censoring "bad" estimates #9