Model structure: density dependence

ebuhle / LCRchumIPM

This is the development site for an Integrated Population Model for chum salmon in the lower Columbia River.

MIT License

4 stars 1 forks source link

Model structure: density dependence #6

Closed ebuhle closed 1 year ago

ebuhle commented 4 years ago

As noted in #3, @tbuehrens made a good suggestion about putting the density dependence in egg-to-smolt survival instead of spawner egg deposition. I had to work through the algebra to convince myself that it would actually result in lower estimated egg-to-smolt survival on average. In cartoon form, using the Beverton-Holt as an example:

# Current

E = alpha * S / (1 + alpha * S / Emax)
M = E * s_EM
  = alpha * S / (1 + alpha * S / Emax) * s_EM
  = alpha * S * s0 * s_EM

where alpha is the age-weighted average per capita fecundity assuming equal sex ratio. The egg-to-smolt survival s_EM is density-independent but the "spawner-to-gravel egg survival" s0 = 1 / (1 + alpha * S / Emax) is density-dependent, reducing the per capita egg deposition below alpha.

# Proposed

E = alpha * S
M = E * s_EM
  = E * psi / (1 + psi * E / Mmax)
  = alpha * S * psi / (1 + psi * alpha * S / Mmax)

where "potential egg provision" is density-independent and egg survival, now defined as body cavity-to-smolt, is density-dependent with maximum survival psi.

Comparing the two parameterizations, it's clear that the current version partitions the overall egg-to-smolt survival into "spawner-to-gravel" survival s0 and the remaining "gravel-to-smolt" survival s_EM. Back-of-the-envelope calculations indicate s0 is typically in the range of 0.2 - 0.4, so s_EM is inflated by a factor of 2.5 - 5 relative to the second version. Seems about right. I doubt the fits to the data will differ appreciably, although there might be some subtleties depending on the random effects structure, etc.

This also makes sense to me biologically, even though @kalebentley and I had previously talked ourselves into the first parameterization. There are potential density-dependent factors operating both before and after eggs are deposited in the gravel, so might as well roll them all together.

tbuehrens commented 4 years ago

Hi @ebuhle, @kalebentley @mdscheuerell @Hillsont @BradGarnerWDFW , I think the second parameterization both makes more sense biologically (in terms of when density dependence occurs...not much evidence for pre-spawn mortality in our chum, but lots of evidence of superimposition), and will produce results that better match empirical estimates of egg to fry survival elsewhere (in gravel egg boxes seem to max out around 60-70% and that is in dam controlled low sediment systems...i'd expect 5-40% for us.

ebuhle commented 3 years ago

Hey folks, sorry I've been quiet lately. Before I get to the other recent Issues, I wanted to give an update on my latest adventures with freshwater survival and density dependence. This has turned out to be a much thornier problem than we anticipated.

First, I've settled (for now) on the formulation marked # Proposed above, but with the process errors being multiplicative lognormal on M, i.e.

M = E * s_EM * exp(eta_year_M + epsilon_M)

rather than acting directly on logit(s_EM) as I had tried initially. The error terms being confined to (0,1) seemed to cause problems, and this approach is also more realistic: we don't know whether any deviation around the expected spawner-smolt relationship was due to egg deposition, egg-to-smolt survival, or (probably) both.

That said, this model still has terrible problems with divergences and low stepsize / high treedepth / slow wall time. After a lot of sleuthing (which has been bogged down by some widespread Stan + R 4.0 issues) it looks like the culprit is our old friend, unreasonably high egg-to-smolt survival. Here are some quick and dirty plots. The posterior of hyper-mean maximum survival mu_psi is slammed up against 1:

The population-specific values mostly follow the hyper-mean except for Duncan_Channel, which appears to be caught between the hyperdistribution and a rather different local mode:

To get a sense of where this is coming from, we can compute an egg-to-smolt survival index using the estimated states and parameters: M0 / E_hat, where M0 refers to smolts by brood year rather than calendar year and E_hat = 0.5 * f * S for age-weighted fecundity f, assuming a 1:1 sex ratio. (The _hat is because, again, we can't know whether process errors come from E or from s_EM, as this index would imply.) Sure enough, there are many cases where smolts exceed the potential egg deposition. This is true even in populations where the estimated states are constrained by local smolt observations (Duncan_Channel, Hamilton_Channel, Hardy_Creek, Grays_CJ, and Grays_MS + Grays_WF combined). Duncan_Channel is the only consistently reasonable-looking one, which explains that lower local mode in psi:

Considering just those six populations with smolt traps, below are the brood years where posterior median smolt output exceeds posterior median expected egg deposition. Because the observation priors are so tight, the posterior medians of S and M0 are close to S_obs and M0_obs (where available), respectively, so this isn't just some pathology in the states. Posterior median age-weighted fecundity f is perfectly reasonable, so we shouldn't be systematically underestimating eggs. Some S_obs are very small, but not consistently so.

               pop year S_obs tau_S_obs   S    f   E_hat  M0_obs tau_M0_obs      M0 M0_div_E_hat
1         Grays_MS 2008   721     0.184 794 2797 1109721 1288376      0.047 1271919          1.1
2 Hamilton_Channel 2007    47     0.082  48 2807   67747      NA         NA   71389          1.0
3 Hamilton_Channel 2009    94     0.029  94 2748  128942      NA         NA  142811          1.0
4 Hamilton_Channel 2012   137     0.170 163 2845  232729  459369         NA  335036          1.4
5      Hardy_Creek 2007    12     0.192  14 2771   19392      NA         NA   22947          1.1
6      Hardy_Creek 2008     3        NA   7 2766    9727      NA         NA   13541          1.1

The spawner-recruit pairs that give rise to these cases with M0 >= E_hat (filled circles) don't seem to differ systematically from the rest (1:1 line is shown):

So unless I'm seriously missing something -- which is possible, after squinting at this stuff for so long -- it appears there are quite a few cases (even in populations with high-quality smolt data, like Hamilton_Channel) where outmigrants approach or exceed estimated egg deposition based on spawner abundance and weighted fecundity. Of course, as I said at the outset, this doesn't imply spontaneous generation of fish; it's entirely possible egg deposition exceeded expectations in those years. I looked for patterns related to sex ratio, but there's no discernible relationship between recruitment process errors and proportion female spawners. Still, the fact that smolt output is even close to egg deposition seems wildly implausible, given the range of survivals @tbuehrens mentions above. Seems like I must be doing something wrong...and yet, there are the data.

I'd love to hear any input or suggestions; this is making me feel a bit crazy. :face_with_head_bandage:

ebuhle commented 3 years ago

I should point out that the results above are from the Beverton-Holt. As you'd expect, things are better (the survival intercept psi is lower) with the Ricker, but not enough to change the patterns at issue. If anything, the M0 / E_hat ratios for the brood years shown above are slightly higher.

ebuhle commented 3 years ago

OK, we're finally getting somewhere! I figured out a major, if not the only, source of the trouble: a subtle but fatal error in downstream_trap, the index described here that maps smolts from Grays_CJ and Grays_WF onto the corresponding outmigration years of Grays_MS where they are counted (or double-counted, in the case of Crazy Johnson). The problem was that I had assigned values of downstream_trap before subsetting the data, so the row indices no longer referred to the right cases. That's fixed now.

Things look much better, at least with the Ricker model. Here's mu_psi, the hyper-mean of density-independent maximum egg-to-smolt survival:

There is still some probability mass pushing up against 1, which manifests as an extremely long right tail on the logit scale. That's probably why the model still has tons of divergences and low stepsize / high treedepth. It's a small enough mass that I would be willing to consider a non-flat prior that just regularizes mu_psi away from the boundary, but the M0 / E_hat ratios suggest that some deeper problems remain:

Now there are only two cases where posterior median M0 exceeds posterior median E_hat, but again they both come from populations with smolt data. As the plots above show, there are many other cases where the 90% credible interval of the ratio exceeds 1, and even the central tendency of mu_psi seems too high (right tail notwithstanding).

               pop year S_obs tau_S_obs    S    f   E_hat  M0_obs tau_M0_obs      M0 M0_downstream M0_div_E_hat
1         Grays_MS 2018  3010      0.20 3025 2665 4029416 5278183      0.066 4595405       5055980          1.1
2 Hamilton_Channel 2012   137      0.17  162 2843  230020  459369         NA  296429        296429          1.2

So even after fixing the downstream_trap issue on the modeling side, it seems like there are still some perplexing data-driven patterns. I may go ahead and try a regularizing prior on mu_psi to see if it resolves the numerical problems (which presumably will be even worse with the Beverton-Holt) but any thoughts / input / reality checks would be welcome.

tbuehrens commented 3 years ago

Before you go too far down the rathole on this, let's chat...I want to chat because I am not sure I totally have my head around the problem and will be able to point in more constructive directions after we chat/look at model structure together.

ebuhle commented 3 years ago

I've been too far down this rathole, but yes, let's chat.

tbuehrens commented 3 years ago

hi @ebuhle I was just talking with @Hillsont and I re-aquainted myself with the equations below (screenshot of version knit on 12/16). It was a bit difficult to track your equations in this issue above. I am assuming that per the earlier discussion in this issue, you switched density dependence to occur during egg to smolt survival (s_EM). However, looking at your equation

M = E s_EM exp(eta_year_M + epsilon_M)

I have a couple of questions: 1) I saw you were having issue with logit transform, but this (new) parameterization allows smolts to be greater than eggs because although s_EM is constrained to 0 to 1, the error terms are not and can therefore more than offset the <1 survival, resulting in M > E. I think you want these error terms to be logit constrained (or other appropriate zero-1 link function).

A more interesting question is why the logit transform was causing problems...i don't fully understand what eta_Year_M and epsilon_M are, but I'd think you would want a at least a year-specific error term and a random error term here. The pop error term is already accounted for in s_EM due to differing productivity and capacity.

Finally, also wonder if data are an issue. It is simple enough when working with one population. I worry in our context that it is messy with wild fish being transferred between populations. I got way more into the weeds than I intended to and I harassed @Hillsont into helping me make a disposition clarification document that I uploaded in the data. I think it may be worthwhile comparing the data after processing in R for the model with hand manipulated data in the csv's for a few years and pops just to make sure that is being done right. @kalebentley please take a look at my disposition doc to make sure we don't have any disagreement.

EDIT: 1) @Hillsont pointed out that the "LCRchumIPM_analysis.html" and pdf versions have not be re-knit in a couple of months. It would be helpful to do so, because it is easier to track the equations and variable names 2) Todd wondered if the fecundity estimates have been updated to filter out partially spawned fish. In the last knitting of the pdf, the partial spawn fish were clearly not filtered out. I created a quick exploratory analysis R file ("TEMP_explore_fecundity_data.R") to look at this. As you can see (bottom image), the fecundity is higher for the subset of fish with Reproductive Effort > 16 (a rule of thumb @Hillsont can tell us more about). He will go back and look for the source on this, but if that filter has not been applied, it could account for part (but not all) of our problem with low egg abundance estimates.

ebuhle commented 3 years ago

Hey @tbuehrens, thanks for weighing in. I feel like your perspective here could be uniquely valuable in helping to make sure the data and model are consistent.

I re-aquainted myself with the equations below (screenshot of version knit on 12/16). It was a bit difficult to track your equations in this issue above. I am assuming that per the earlier discussion in this issue, you switched density dependence to occur during egg to smolt survival (s_EM).

That's right, I modified the spawner-egg-smolt transition(s) as shown in the OP in this thread, but the vignette has not been updated. Sorry about that; I've been bogged down iterating on the code and data and haven't prioritized updating the description until we've got something stable.

I have a couple of questions: 1) I saw you were having issue with logit transform, but this parameterization allows smolts to be greater than eggs because although s_EM is constrained to 0 to 1, the error terms are not and can therefore more than offset the <1 survival, resulting in M > E.

Correct, as I argued above, I think the multiplicative lognormal error structure on resultant smolt recruitment is more realistic than errors confined to logit survival because it's agnostic about whether the deviations from the expected relationship come from egg deposition or egg-to-smolt survival (or more likely both). I don't see any particular reason to think egg production is deterministic and only survival is stochastic. So yes, as I noted and as you point out, the index M0 / E_hat can exceed 1 in principle because it implicitly shoves all the stochasticity into M0 (since E_hat is deterministic). The perplexing part, though, is that the ratio would be anywhere close to 1 in general -- and, more definitively and more to the point, that the baseline survival parameters themselves (mu_psi and pop-specific psi) would be anywhere near 1.

FWIW, I tried a regularizing prior on mu_psi in the Ricker model, and it actually didn't help at all with the numerical issues. While it effectively squashed mu_psi values > 0.9, it did not prevent some of the pop-level psi values from pushing up against 1 -- another indication that those high values are driven by local patterns in the data such as the Hamilton Channel 2012 case shown above.

A more interesting question is why the logit transform was causing problems...

That is a good question, but it seemed like confining the stochastic terms to (0,1) was too restrictive -- at any rate, the sampler behavior was really dire, even worse than what I'm showing for the version above. It makes sense to me that you'd want to allow process errors > 1, e.g., to account for egg deposition exceeding expectations. But you still wouldn't necessarily get M0 exceeding E_hat, if psi were more reasonable. Quite possibly the real issue with the logit-normal process errors is that they can bump an already-too-high psi even higher, hence further into that long tail on the logit scale that kills the sampler performance.

i don't fully understand what eta_Year_M and epsilon_M are, but I'd think you would want a at least a year-specific error term and a random error term here. The pop error term is already accounted for in s_EM due to differing productivity and capacity.

Yes, that's exactly right: eta_year_M is the shared year-specific process error, epsilon_M is the unique process error, and population-level variation is accounted for by the hyperdistributions of psi and Mmax. The names of the error terms are analogous to the ones in the screenshot, except they act on smolt output (_M) instead of only on survival (_EM).

Finally, also wonder if data are an issue. It is simple enough when working with one population. I worry in our context that it is messy with wild fish being transferred between populations. I got way more into the weeds than I intended to and I harassed @Hillsont into helping me make a disposition clarification document that I uploaded in the data. I think it may be worthwhile comparing the data after processing in R for the model with hand manipulated data in the csv's for a few years and pops just to make sure that is being done right. @kalebentley please take a look at my disposition doc to make sure we don't have any disagreement.

This is what I've been wondering too. I'll take a look at that disposition document, and I agree some spot checking would be useful.

Todd wondered if the fecundity estimates have been updated to filter out partially spawned fish. In the last knitting of the pdf, the partial spawn fish were clearly not filtered out. I created a quick exploratory analysis R file ("TEMP_explore_fecundity_data.R") to look at this. As you can see (bottom image), the fecundity is higher for the subset of fish with Reproductive Effort > 16 (a rule of thumb @Hillsont can tell us more about). He will go back and look for the source on this, but if that filter has not been applied, it could account for part (but not all) of our problem with low egg abundance estimates.

Yes, that filter has been applied here:

https://github.com/mdscheuerell/chumIPM/blob/9f5015bb1b727a43f9ffbb7504d8e2315ab5a430/analysis/R/LCRchumIPM_analysis.R#L204-L210

There's a separate issue about "partial spawners" in Duncan Channel that @kalebentley explored here, which would cut in the opposite direction if we decided to incorporate it.

tbuehrens commented 3 years ago

@ebuhle wow! blazing fast reply to a post that took me a hour or 90 mins on the phone with Todd to dig into and develop! A couple points:

1) hope do i do the cool "copy of your text in gray"?

2) Now i understand your logic re: multiplicative vs logit constrained errors. A couple thoughts on this:

a) the age specific fecundities already incorporate SOME error (in actuality, related to the sample size of fish of each age). Theoretically, we could develop a hyper distribution for the mean size at age across years if we had more years of hatchery data to get a measure of year to year variability in the means at each age. Practically, we can probably put some pretty tight bounds on this from the literature. Size/fecundity at age simply should NOT be as variable as egg to fry survival!!!! In lieu of this, I think we either 1) COULD treat fecundity as deterministic (no year, pop errors), which I think is OK, because we already have spawner abundance observation error, and this would allow us to basically have mean fecundity at age (with sampling related uncertainty) serve as an offset for spawner abundance, or 2) put in a heavily constrained (log normal) error term of spawer to egg survival, which would basically have the function of simulating unknown year to year variation in size/fedundity at age. Either option would allow us to move back to logit for s_EM errors (which I think is a good thing due to reason 2 immediately below).

b) Even using your logic of log-errors, the problem in my mind is that there is no upper bound to those errors, so even in a case where we know the max possible fecundity, theroretically it is possible for the error to account for Spawner to egg and egg to fry errors, but result in a value that is impossibly high (say, 10,000 eggs per spawner). now the rest of the model (and data) won't allow it to go there, but I don't think it's good practice to allow it to be unconstrained.

Happy to discuss data issues once you and @kalebentley have looked at the disposition doc

mdscheuerell commented 3 years ago

Hi Thomas,

I'll answer the easy one here:

hope do i do the cool "copy of your text in gray"?

You specify quotations in Markdown by preceding the text with >, such that

> This is a quote.

will render to

This is a quote.

ebuhle commented 3 years ago

hope do i do the cool "copy of your text in gray"?

You select it and do "quote reply" in the upper right menu. :sunglasses:

a) the age specific fecundities already incorporate SOME error (in actuality, related to the sample size of fish of each age). Theoretically, we could develop a hyper distribution for the mean size at age across years if we had more years of hatchery data to get a measure of year to year variability in the means at each age. Practically, we can probably put some pretty tight bounds on this from the literature.

Yeah, I think we've talked on here about the possibility of incorporating some hierarchical variation (and maybe even "observation error") in the hatchery female fecundity data. The problem, as you point out, is that there aren't many years to work with, plus it's not obvious how or whether to map specific hatcheries onto specific wild populations.

Size/fecundity at age simply should NOT be as variable as egg to fry survival!!!!

Sure, but "eggs in gravel" is another matter, subject to superimposition and gravel quality and flow, etc., no? I guess it's somewhat of a metaphysical question of whether we mean eggs in gravel or eggs in body cavity (which, granted, is what I wrote in the OP, and is more consistent with the nature of the fecundity observations).

In lieu of this, I think we either 1) COULD treat fecundity as deterministic (no year, pop errors), which I think is OK, because we already have spawner abundance observation error, and this would allow us to basically have mean fecundity at age (with sampling related uncertainty) serve as an offset for spawner abundance

Yup, this is the way I initially tried it, and sampling was even more hopeless than the version shown above.

or 2) put in a heavily constrained (log normal) error term of spawer to egg survival, which would basically have the function of simulating unknown year to year variation in size/fedundity at age.

I think this would result in nonidentifiability because the errors in E would be perfectly confounded with the corresponding unique errors in s_EM. Or to put it another way, we can't know for sure whether process errors are coming from egg deposition or egg-to-smolt survival.

b) Even using your logic of log-errors, the problem in my mind is that there is no upper bound to those errors, so even in a case where we know the max possible fecundity, theroretically it is possible for the error to account for Spawner to egg and egg to fry errors, but result in a value that is impossibly high (say, 10,000 eggs per spawner). now the rest of the model (and data) won't allow it to go there, but I don't think it's good practice to allow it to be unconstrained.

I can see this argument, but another way to look at it is that the product of density-independent E and density-dependent s_EM is just like any other spawner-recruit function, and lognormal errors around the combined function would be a standard approach.

Regardless, I don't really have strong feelings about this -- I'd be happy to make E deterministic and only s_EM stochastic, but that only worsens the numerical problems that we're trying to resolve. I think this whole issue of error structure is a bit of a red herring (sorry, wrong study system) from what's going on with the mean relationship, in particular why psi wants to be so high. If we could resolve that, then presumably either error structure would work, computationally speaking.

ebuhle commented 3 years ago

One more thing, just to sharpen my point about fecundity: the posterior median age-weighted fecundity f shown in the tables above seems perfectly reasonable. Yet in the case of Hamilton_Channel brood year 2012, the observed smolt outmigration is double the expected (deterministic) egg production! The process model actually pushes S up and pulls M0 down relative to the data, but it still can't get M0 / E_hat below 1.

The smolt observation for Grays_MS 2018 is harder to interpret, because that trap also counts smolts from upstream pops (which the model accounts for using the downstream_trap construct). But there are plenty of other cases in pops with local smolt traps that are borderline, even if the median M0 / E_hat may be < 1.

Hillsont commented 3 years ago

I may be able to shed some light on Hamilton Springs. Hamilton Springs is a constructed spawning channel (mid 1980s) that didn’t have much, if any heavy maintenance, until summer/fall 2011. USFWS staff did try and control the canary reed grass in the early 2000s when they were doing adult and juvenile monitoring in the channel by pulling/cutting it. Their monitoring work in the channel stopped in 2008 and so did any control of the reed grass. Over the next couple of years the canary reed grass quickly colonized most of the spawning channel gravel and became so thick that essentially any adult that died in the channel was unlikely to drift or “bloat and float” out of the channel and into Hamilton Creek. In 2011, the existing spawning gravel and canary reed grass was completed removed and new spawning gravel was placed along with toe logs and re-infestation measures. In addition to that, a new 285' long extension/arm was added to the spawning channel. The result of the “cleaning” and added arm was dramatically increased flow and water velocity within the channel and more adults that actually spawned and died in the channel floating out into Hamilton Creek. Since we use carcass tagging as the estimation method in the channel, we believe this resulted in an under-estimation of adults that spawned in the channel and over-estimated the adults in Hamilton Creek due to “wash-outs” from the spring channel.

Realizing what was happening and how it would impact our adult to outmigrant estimates (skewed male/female plus unknown carc washout rate), in 2013 we installed a live/carcass weir (IMG_0515.jpg) to improve our adult abundance estimates by preventing carcs from washing out. The weir was also used to capture lives for tagging and broodstock collection, hence in the pic there’s no upstream passage possible. The pvc bars at the top/apex of the weir are removable, allowing adult passage into the channel, and only in place when we wanted to trap live adults. The lower panels are placed so that any carcasses that passed through the center of the upper v-weir when in open/not trapping status would be “captured” on the lower panel. It worked but we saw opportunity to improve. In 2014 we extended the entrance panels in an effort to prevent lives from exiting the channel (IMG_0952.jpg), and eventually installed a panel of “fingers” at the top of the entrance to try prevent adults from leaving the channel once they entered and improve broodstock collection prospects when the upper weir was “closed”. Since then we’ve continued to try and make improvements at the adult weir to get accurate estimates of spawners between Hamilton channel and Hamilton Creek.

Long story short, assuming a 50/50 sex ratio in Hamilton Springs Channel may not be a good idea and adult estimates in Hamilton Springs before we corrected the carcass “wash-out” problem after the 2011 maintenance/rebuild may not be reliable.

Thanks TH

ebuhle commented 3 years ago

Thanks @Hillsont, super informative as usual! The photos are helpful. Sounds like this could very well be our culprit:

The result of the “cleaning” and added arm was dramatically increased flow and water velocity within the channel and more adults that actually spawned and died in the channel floating out into Hamilton Creek. Since we use carcass tagging as the estimation method in the channel, we believe this resulted in an under-estimation of adults that spawned in the channel and over-estimated the adults in Hamilton Creek due to “wash-outs” from the spring channel.

Realizing what was happening and how it would impact our adult to outmigrant estimates (skewed male/female plus unknown carc washout rate), in 2013 we installed a live/carcass weir (IMG_0515.jpg) to improve our adult abundance estimates by preventing carcs from washing out.

Underestimating spawners and thus eggs is consistent with the spike in apparent survival in brood years 2011-2012 with M0/E_hat above 1, before settling into much more believable values from 2013 to the end of the series. The ratio is somewhat high before 2011 too, especially for a RCG-infested channel, but it's also not constrained by smolt data from 2006-2009.

Long story short, assuming a 50/50 sex ratio in Hamilton Springs Channel may not be a good idea and adult estimates in Hamilton Springs before we corrected the carcass “wash-out” problem after the 2011 maintenance/rebuild may not be reliable.

Now that you remind me, Hamilton_Channel was one of the ones that stood out when we talked about unequal sex ratios last year. I did look at sex ratio in relation to these process errors in apparent egg-to-smolt survival, but it was a total shotgun.

Throwing out 2 years of spawner data corresponding to one of the most complete and "direct" of our 5 smolt time series is painful, but if we must, so be it. I can at least try it and see how the estimates change. Fingers crossed!

tbuehrens commented 3 years ago

Thanks Eric and Mark re how to paste quoted text!

Eric---agree errors are likely red herring relative to means! And your arguments make sense...a bit semantic which error structure we choose. This makes me want to focus on the means...wondering if more data cleanup (remove biased years!) and disposition stuff I mentioned earlier would help???

tbuehrens commented 3 years ago

@Hillsont your post re: hamilton was very informative. Now would be a great time to let us know if there are any other years/locations with sketchy/likely biased abundance estimates. Uncertainty is fine (we have a way to deal). Unknowns are fine (we can impute). But bias is problematic if is non-negligible. I usually think about this like a stop sign--green (no known problems), yellow (possibly non-negligible bias/unsure), red (likely nonnegligible bias).

Hillsont commented 3 years ago

There's a couple other years and locations that I have marked as concerns or oddballs in my population summary sheet.

Grays MS (2009), Hardy Creek (2008, 2015, and 2017), Horsetail (2005 and 2009), and St Cloud (2007, 2008, and 2012). These are all identified in the data set by either NAs or method (Min #). The 2015 Grays MS estimate is included in the data set, I have a comment though that I had to pool from seven periods down to only three and poor environmental conditions.

We've known for a long time that movement of carcasses (washouts) in the Grays could be an issue with reach specific estimates. Especially between Grays WF and Grays MS. A lot of the chum spawning activity in the WF Grays occurs in the lower 1/2 mile or so, between the mouth of CJ and the mouth of the WF. During high or moderate flow events there can be carcass movement from the WF into the MS. I don't believe this biases the WF estimate, no reason to think tagged and untagged carcasses would washout at different rates. It can though result in the same issue I mentioned above about Hamilton Springs and Hamilton Creek, fish that spawned and died in the WF get included in the MS spawning population estimate.

Something that may eventually need to be looked at is partitioning the MS Grays estimate into above and below the smolt trap. In most years the vast majority of spawning takes place above the trap site but there have been years (big returns) when it's possible that several hundred adults spawned below the trap.

On the same note, the juvenile trapping the USFWS did in Hamilton Springs was also not below the whole spawning population. It was below probably 95-99% depending on the year. The section below the trap site was more of an access reach to the actual spawning channel. We'd have to go back and look at their survey data to partition spawners if we wanted to. WDFW is juvenile trapping at the same site as the USFWS did but the habitat work done in Hamilton Creek in 2010 changed the channel just below the trapping site from being part of the channel to being part of the creek so any spawners in that area have been included in the Creek estimate not the channel estimate. It's hard to explain clearly so pictures again.

First the change in 2010. The red line is Hamilton Channel, blue line is what was also channel before 2010, green line shows the cut they made and LWD jam that resulted in the lower area being considered creek instead of channel. Little black line is juvenile trap location.

Added this one to show the additional "arm" (blue line) that was added to Hamilton Springs Channel in 2011

ebuhle commented 3 years ago

Well, I re-fit the Ricker model with S_obs and tau_S_obs for Hamilton_Channel brood years 2011-2012 changed to NA. It made a noticeable improvement in the estimates of psi (especially for the population in question) and mu_psi, but some squirrelly estimates and numerical pathologies remain.

Focusing on populations with smolt monitoring data, whether direct or indirect, it looks like the most glaring remaining anomaly is Grays_MS brood year 2018. This could potentially be coming from Grays_MS or Grays_WF, since their smolt counts are commingled, and/or from Grays_CJ, although it is also constrained by its own trap. That makes the numbers below a bit confusing to interpret, but what stands out to me is that

estimated smolts M0 in Grays_CJ are substantially lower than the point observation M0_obs, which could therefore be overestimating the contribution from the other two pops.
the Grays_MS trap count M0_obs is close to the estimated state M0_downstream (which includes smolts from the upstream pops), so this seems to be another data-driven pattern.
I don't understand why M0 != M0_downstream for Grays_WF -- they should be identical except for traps that collect upstream as well as local smolts. This makes me think something is still off with downstream_trap, which I'll investigate.

       pop year S_obs tau_S_obs    S    f   E_hat  M0_obs tau_M0_obs      M0 M0_downstream M0_div_E_hat
1 Grays_CJ 2018   899     0.024  899 2673 1201613  276500      0.365  471998        471998         0.38
2 Grays_MS 2018  3010     0.200 3124 2674 4177990 5278183      0.066 4579083       5051081         1.09
3 Grays_WF 2018  2847     0.064 2821 2687 3793543      NA         NA 1603014         25174           NA

Looking at the psi posteriors, Grays_WF is actually the more problematic one. Any caveats about the data for any of these cases?

It's also possible that SAR may be playing a role here. If the Grays system has higher SAR than the upriver populations because it's closer to saltwater, then the hyperdistribution of s_MS could be pulling the local values down and thus inflating freshwater survival. If this is true, then including distance from the Columbia mouth as a covariate like we talked about in #7 might help, but this is all speculative -- especially since FW survival for these pops is constrained by smolt data.

[EDIT: Sorry, I was writing this while @Hillsont posted above.]

tbuehrens commented 3 years ago

thanks @Hillsont. I'll let @ebuhle weigh in here, but I'd guess that it may be worthwhile adding a field to the disposition table that would denote questionable estimates...that way eric can filter those years/estimates if needed with ease.

tbuehrens commented 3 years ago

@ebuhle Looking better. I thought about the SAR issue you mention but since there is already a year and pop effect on SAR (and a random residual), it seems like it shouldn't be "pulling" too much down on SAR for grays...agree that covariates will further mitigate whatever pulling there is. Nonetheless, I remain convinced the two most promising pathways are: 1) identifying and possibly censoring likely biased escapement or smolt estimates. 2) continuing to scour for coding issues (ala downstream_trap issue you mentioned), 3) making sure that the way we are summing up total spawners based on the observations matches the disposition cheat sheet I pushed yesterday

ebuhle commented 3 years ago

I don't understand why M0 != M0_downstream for Grays_WF -- they should be identical except for traps that collect upstream as well as local smolts. This makes me think something is still off with downstream_trap, which I'll investigate.

Quick update on this. The problem wasn't downtream_trap, it was just my summary table. M0_downstream for Grays_WF should be NA. But that raises another, potentially very significant issue that had escaped my attention until now. The Grays_CJ and Grays_MS time series go through calendar year 2019 (with smolts only; spawners are missing) but Grays_WF stops in 2018. That means the model is not estimating outmigrants M for Grays_WF in 2019, which means they can't be added to the "downstream trap" total for Grays_MS, leaving the other two Grays populations to try and make up the difference to fit M_obs at the mainstem trap. Which could certainly explain the anomaly!

I hadn't thought of this scenario when designing the downstream_trap construct, but clearly any pops whose smolts are pooled need to have fully matching rows in the data. I'll pad Grays_WF accordingly and see what happens.

I thought about the SAR issue you mention but since there is already a year and pop effect on SAR (and a random residual), it seems like it shouldn't be "pulling" too much down on SAR for grays...agree that covariates will further mitigate whatever pulling there is.

Yeah, I haven't looked closely at SAR for the respective populations, but I tend to agree. If there weren't local smolt data to constrain the FW stage, that'd be a different matter.

3) making sure that the way we are summing up total spawners based on the observations matches the disposition cheat sheet I pushed yesterday

I took a quick look at this yesterday and I think it matches the way we're processing the data. I'll give this and @Hillsont's post above a closer look later on when I have more brain cells available.

tbuehrens commented 3 years ago

Sweet. Thanks.

Re which years the model estimates, I would have thought that it wouldn't have pop specific start or end dates...instead it would have year or pop specific availability of observations. In fact, this must be the case since we are forecasting with it. So why can't we impute whatever is missing? Not sure I'm understanding

Regardless, Todd and Kale should be able to provide more recent estimates for Grays_WF, i'd think.

Hillsont commented 3 years ago

Need 2018 Grays WF adults? 2,902 was the estimate

tbuehrens commented 3 years ago

We should be developing a more automated process for updating the data tables used in the IPM

Hillsont commented 3 years ago

I looked at the 2018 tagging data and didn't see anything that would make me think the Grays MS, WF or CJ adult estimates would be biased. It does look like there was a pretty severe storm/flow event in late November right around normal peak die-off time (10 days between survey periods) and we had mostly off-diagonal recoveries (seven of eight recoveries) for tags put out before the storm in the MS, not as high of a rate for off-diagonal recoveries in WF, but still several (7 of 19 were off-diagonal).

Hillsont commented 3 years ago

@tbuehrens, the Grays WF 2018 spawner estimate was in the data set (check the file you and I messed around in yesterday). Not sure what happened.

ebuhle commented 3 years ago

Re which years the model estimates, I would have thought that it wouldn't have pop specific start or end dates...instead it would have year or pop specific availability of observations. In fact, this must be the case since we are forecasting with it. So why can't we impute whatever is missing? Not sure I'm understanding

salmonIPM estimates states for each row in the data, so if we want Grays_WF 2019, it needs to have a corresponding row. Forecasting, as we're doing it currently, works the same way: we pad the data with future rows.

Need 2018 Grays WF adults? 2,902 was the estimate

Thanks @Hillsont, but 2019 is the missing year. That won't affect the 2019 outmigration or our ability to fix the issue I flagged, though, since we do have spawners in 2018. That number, BTW, is 2847 according to Data_ChumSpawnerAbundance_2019-12-12.csv. Maybe the discrepancy comes from different summary statistics of the observation prior, although the data file reports the mean, so it should be higher than the median.

We should be developing a more automated process for updating the data tables used in the IPM

I agree, this would be a good thing to put on the board at some point. @kalebentley and I have talked about it, but didn't come to any resolution.

Hillsont commented 3 years ago

My bad @ebuhle, I forgot the data set we provided include only "Adult" estimates even in years when selectivity test indicated we need to run males and females separately. 2,902 is the combined combined male and female estimate and 2,847 is the Adult estimate.

The reason there's no 2019 adult data had to do with when I pulled all the spawner data together. The 2019 juvenile season had been completed but the 2019 adult return hadn't happened yet and we haven't updated any of the data files since they were sent to you by @kalebentley back at the start.

I also agree a streamlined automated process would be nice. The problem is that not all of the data is in one place, some of it is scattered across corporate databases (TWS, JMX, and FishBooks) and some only lives in spreadsheets because the corporate databases don't have fields to store the data, e.g. individual bio-data and fecundity estimates for broodstock fish, or the raw adult return data by sex, age and origin. To head off an expected comment, yes origin is a field in TWS. However, it's based on clip status (HOR chum are not ad-clipped) and there's no fields for PBT or otolith decoding results. That will change when TWS retires and gets replaced by the new SG database if Danny has anything to say about it. My expectations for the new version of Fishbooks are not very high though.

tbuehrens commented 3 years ago

@Hillsont understood re automation...setting up processes takes time...but good to push on the BDS team to build databases we need rather than work on some of the unfunded crap they have been.

@ebuhle INTERESTING re states only for pops with data for that year and forecasts for padded years. I have done stuff like this too (forecasting and hindcasting in the generated quantities block), but I have never had separate state end years for a multivariate timeseries model because you're generating the multivariate process errors for all pops for all years that at least one pop needs a state estimated...I haven't seen the IPM stan code in a while and the last time I dug around in the repo, I didn't locate it easily....is there a way to get eyes on it without sending it to BPA consultants to steal? (hehe)

ebuhle commented 3 years ago

Thanks for the clarification on the escapement estimate, @Hillsont.

I have done stuff like this too (forecasting and hindcasting in the generated quantities block), but I have never had separate state end years for a multivariate timeseries model because you're generating the multivariate process errors for all pops for all years that at least one pop needs a state estimated.

Yeah, the way it works in salmonIPM is that any multivariate or shared states (e.g., common annual process errors) are generated for the full range of years and then the appropriate values are picked off for each row in the data. It's sort of like standard R fitting functions in that sense -- you get an estimate / prediction for each case in the input, and not-actually-observed cases are handled through a newdata-like construct.

There is another forecasting mechanism that I built for a different salmonIPM model, but haven't implemented in IPM_LCRchum_pp yet. That approach involves passing a separate "future data" object which is then handled in generated quantities. Unlike the main fish_data argument, it allows duplicated pop x year cases, which is useful for evaluating alternative scenarios (e.g. different harvest control rules or hatchery policies) with the same set of posterior draws.

I haven't seen the IPM stan code in a while and the last time I dug around in the repo, I didn't locate it easily....is there a way to get eyes on it without sending it to BPA consultants to steal? (hehe)

LOL. I just sent you an invite to the salmonIPM repo, so you can poke around in the Sensitive Compartmented Information Facility to your heart's content.

EDIT: I should point out that the current stuff is in the develop branch. Just to further throw BPA off the scent.

ebuhle commented 3 years ago

There's a couple other years and locations that I have marked as concerns or oddballs in my population summary sheet.

Grays MS (2009), Hardy Creek (2008, 2015, and 2017), Horsetail (2005 and 2009), and St Cloud (2007, 2008, and 2012). These are all identified in the data set by either NAs or method (Min #). The 2015 Grays MS estimate is included in the data set, I have a comment though that I had to pool from seven periods down to only three and poor environmental conditions.

Thanks again for all this detailed info, @Hillsont. I had actually flagged St_Cloud (2002, 2007, 2008) as being anomalously low compared to the surrounding years, and made a note to ask about it. I also noticed those spawner observations don't have associated uncertainty estimates, which I gather is due to the method (Min #).

I'm loath to start designating "outliers" willy-nilly, especially given the modest overall sample size, but it's good to keep these in mind as we interpret results and diagnose unexpected patterns. We'll see what comes of inserting the missing Grays_WF 2019 row, now that I've finally figured out a "clean" way to do it.

Something that may eventually need to be looked at is partitioning the MS Grays estimate into above and below the smolt trap. In most years the vast majority of spawning takes place above the trap site but there have been years (big returns) when it's possible that several hundred adults spawned below the trap.

On the same note, the juvenile trapping the USFWS did in Hamilton Springs was also not below the whole spawning population. It was below probably 95-99% depending on the year.

Hmm, this sounds like it would be tricky to handle on the model side (to say nothing of the data). Another good point to keep in mind, though. I'd imagine there is also considerable "straying" between Hamilton_Creek and Hamilton_Channel, but that's a whole 'nother issue...literally.

ebuhle commented 3 years ago

3) making sure that the way we are summing up total spawners based on the observations matches the disposition cheat sheet I pushed yesterday

Hey @tbuehrens, I think we're on the same page, but I'm not sure I understand what you mean about subtracting rows. My understanding has been that the raw escapement Data_ChumSpawnerAbundance_2019-12-12.csv are maximally disagreggated, so processing would only involve summing. And that summing follows the general logic "sum all spawners (per year) with a given disposition, regardless of original return location". It's probably easier to be precise if I just dump the code here:

https://github.com/mdscheuerell/chumIPM/blob/e3031db53ff5fa50350c3f424551af88846f64af/analysis/R/LCRchumIPM_analysis.R#L57-L70

You can go to the chunk just above that for the gory details of cleaning up the raw data. As for Duncan, yes, the population is defined by disposition == "Duncan_Channel", although it's more complicated for the bio_data. If you're a glutton for punishment you can peruse this way-too-detailed comment where I laid out the logic and see if it matches yours.

ebuhle commented 3 years ago

I hadn't thought of this scenario when designing the downstream_trap construct, but clearly any pops whose smolts are pooled need to have fully matching rows in the data. I'll pad Grays_WF accordingly and see what happens.

I did this, and also slightly tightened the priors on mu_psi and sigma_psi. It's another incremental improvement, but unfortunately not a silver bullet. With adapt_delta = 0.99 we're down to a handful of divergences, but wall time is up to ~6 hr and it's still frequently maxing out treedepth. This is all for the Ricker; presumably things would be even worse with the B-H.

In the M0 / E_hat vs E_hat plots below, filled circles indicate states that have corresponding (S_obs, M0_obs) pairs (additional cases may have only one or the other). The states and observations are generally very close.

So it's starting to look like the data really do support a maximum, density-independent egg-to-smolt survival around 0.7, except in Duncan_Channel where it's closer to half that. It's counterintuitive that a constructed spawning channel would have the lowest potential survival, but otherwise this doesn't appear to be driven by a small number of outlier observations. If there's something wrong, it would have to be pretty systemic.

If that's the case, what do we do about it? I'm doubtful there's much more to be gained by tweaking priors; the B-H is already going to have a major prior-likelihood conflict. Maybe we could reparameterize, e.g., in terms of "steepness" (expected survival at X% of capacity), which would have a posterior further from 1. That would break the common interpretation of psi between the exponential S-R function (which I haven't fit recently) and the two density-dependent models, but I don't have any other clever ideas at the moment.

ebuhle commented 3 years ago

Hey @tbuehrens, I think we're on the same page, but I'm not sure I understand what you mean about subtracting rows. My understanding has been that the raw escapement Data_ChumSpawnerAbundance_2019-12-12.csv are maximally disagreggated, so processing would only involve summing. And that summing follows the general logic "sum all spawners (per year) with a given disposition, regardless of original return location".

@tbuehrens explained this on the phone today, and it sounds like we may not be on the same page after all. At the end of my conversation with @kalebentley last summer that led to the revised data-munging outlined in the comment I linked earlier, I thought I understood that disposition refers to where the fish ended up (presumably) spawning, and thus was the grouping unit for escapement and bio_data. But it sounds like the meaning of disposition may in fact differ between populations -- in some populations, indicated in @tbuehrens's spreadsheet, broodstock removals take place after the abundance estimate shown in Data_ChumSpawnerAbundance_2019-12-12.csv. (Wouldn't that make it the same as location?) If this is true, then the data processing for those populations needs to be revised. (It's unlikely to affect the freshwater survival issue, though, as none of the pops in question have smolt data.)

@tbuehrens suggested it would be helpful to write out the data frame shown in the plots above, i.e., the relevant columns of fish_data_SMS (the actual input to modeling) augmented with the estimated states and indices, so that @kalebentley could spot-check it against the raw data for (1) any "suspect" spawner or smolt observations associated with anomalously high smolt-to-egg ratios and (2) data-processing logic. That file is here and contains the variables:

pop year S_obs - spawners; observation point estimate aggregated from Abund.Mean in raw data tau_S_obs - spawner observation error CV calculated from Abund.Mean and Abund.SD S - spawners; posterior median of states f - age-weighted fecundity; posterior median of states E_hat - expected egg deposition; posterior median of states M0_obs - smolts from brood year; observation point estimate from Abund_Median in raw data tau_M0_obs - smolt observation error CV calculated from Abund_Mean and Abund_SD M0 - smolts from brood year; posterior mean of states downstream_trap - row corresponding to downstream smolt trap (only applies to Grays_CJ and Grays_WF) M0_downstream - smolts at trap; posterior mean of states (only differs from M0 in Grays_MS) M0_div_E_hat - M0 / E_hat index of freshwater productivity as plotted above; posterior median

tbuehrens commented 3 years ago

Thanks Eric! I talked through each of these with kale today after we got off the phone...he'll follow up on both. Good talking today!

kalebentley commented 3 years ago

A few quick updates regarding the last couple of posts:

@tbuehrens explained this on the phone today, and it sounds like we may not be on the same page after all. At the end of my conversation with @kalebentley last summer that led to the revised data-munging outlined in the comment I linked earlier, I thought I understood that disposition refers to where the fish ended up (presumably) spawning, and thus was the grouping unit for escapement and bio_data. But it sounds like the meaning of disposition may in fact differ between populations -- in some populations, indicated in @tbuehrens's spreadsheet, broodstock removals take place after the abundance estimate shown in Data_ChumSpawnerAbundance_2019-12-12.csv. (Wouldn't that make it the same as location?) If this is true, then the data processing for those populations needs to be revised. (It's unlikely to affect the freshwater survival issue, though, as none of the pops in question have smolt data.)

I was able to touch base with Todd yesterday to discuss chum data/estimates as it pertains to the way we've specified estimates of spawner abundance using the “Location.Reach” and “Disposition” fields in the "Data_ChumSpawnerAbundance" file. Thomas had highlighted this as a potential issue last week here and created a "disposition document" here to help resolve this issue.

In short, our current data wrangling is correct and thus the summary in Thomas' spreadsheet is incorrect at least in the sense of obtaining observed spawners (S_obs) based on how estimates are summarized and reported currently.

So that everyone is on the same page:

"Disposition" = the estimated number of fish that presumably spawned and/or died at that location and does NOT include fish that were removed for brood/translocation regardless of the method used to survey the population (carcass vs. live surveys)
"Location.Reach" = the estimated number of fish that returned to that location. If "Location.Reach" does not equal "Disposition" then those fish were either removed for broodstock or translocated to another location (i.e., Duncan channel).

I don't want to make this any more complicated than it needs to be but I want to quickly expand on my statement above where I said "Thomas' spreadsheet is incorrect at least in the sense of obtaining observed spawners (S_obs)...". Although not directly stated, Thomas' spreadsheet is correct in that the methods used to generate estimates of abundance are different at mainstem vs. tributary sites/populations. I am not directly involved with generating the adult chum estimates so I don't know all the nitty-gritty details but in general, estimates are generated effectively two general ways depending on the site (@Hillsont please correct my mistakes):

Tributaries sites = spawning ground surveys involved M-R surveys of carcasses only meaning that any broodstock removals occur prior to carcass recoveries. Thus when estimates of total abundance are generated, any live fish that was removed for broodstock are not part of the generated estimate of abundance.
Mainstem sites = spawning ground surveys involved M-R surveys of live spawners (and carcasses sometimes, maybe??). Here, the collection of broodstock/translocation fish occurs simultaneously with surveys for spawner abundance. Thus, fish that are ultimately removed for broodstock/translocation are "tallied" and included in the M-R generated estimate of abundance (these fish are denoted as "loss on capture" in the Jolly-Seber data summary). HOWEVER, Todd has been "manually" removing/subtracting these broodstock/translocation fish out before reporting them in our "Data_ChumSpawnerAbundance" file at least in most years. I'll come back to this in a second.

Therefore, again, based on the way estimates are currently generated and reported in our "Data_ChumSpawnerAbundance" file, the interpretation of "Location.Reach" and "Disposition" is the same across all sites/populations/estimates.

As I alluded to above, over the past few days, Todd realized that there have been some inconsistencies in how estimates of abundance have been summarized and reported "Data_ChumSpawnerAbundance" file. Specifically, it appears as though all estimates from 2002-2014 were reported the way I previously described (i.e., removals are subtracted from total estimate at mainstem sites prior to being reported in our "Data_ChumSpawnerAbundance" file). However, since 2015, broodstock/translocation removals have NOT been subtracted out of the generated live M-R estimates. Thus, mainstem spawner estimates are high. The totals aren't that different because the total number of removals is relatively small but nonetheless the estimates are incorrect (i.e., inconsistent with the earlier time series). Todd is in the process of flagging and fixing the "Data_ChumSpawnerAbundance" file so that all reported estimates are consistent and data wrangling can remain the same.

That being said, this does bring up a subtle point about the reported estimates and specifically the estimate of uncertainty. Technically, the estimate of uncertainty includes the fish that are removed for brood. If I had to guess, ignoring this as we currently are will have a small effect on the actual uncertainly but worth highlighting and perhaps addressing in the future.

Lastly, Todd is also going to update a few other spawner estimates for a second reason. I don't want to go into much detail here but we can certainly discuss this potential issue later.

In short, prior to running the mark-recapture analysis, recapture rates of males and females are assessed to see if stratification of sampled fish is warranted. In general, stratification is not warranted and thus carcasses are "pooled". However, there are a limited number of cases when the data were stratified. When stratified occurred, separate estimates of abundance were generated for males and females and recombined for a total estimate "post hoc". The recombining process has involved summing the means and SDs (i.e., squaring SD, summing variance, and sqrt to get combined SD). This assumes the estimates were normally distributed which is not appropriate in some years. Clearly this is an issue we can address in many different ways moving forward. However, the existing estimates are what they are. Todd and I talked this out yesterday and decided that he is flag any estimate that was generated via stratification/recombination and replace it with the pooled estimate in the "Data_ChumSpawnerAbundance" file. This will ensure that the estimates of uncertainly are accurate (i.e., derived from the posterior MCMC draws). I realize there's a trade-off here depending on how different the pooled vs. stratified estimates are but in general they are almost never "statistically" different. The pooled estimates have already been run and overall there are not many that will be changed.

Todd is planning on having the "Data_ChumSpawnerAbundance" file updated later today or tomorrow. I will post an updated file version to GitHub soon after.

@tbuehrens suggested it would be helpful to write out the data frame shown in the plots above, i.e., the relevant columns of fish_data_SMS (the actual input to modeling) augmented with the estimated states and indices, so that @kalebentley could spot-check it against the raw data for (1) any "suspect" spawner or smolt observations associated with anomalously high smolt-to-egg ratios and (2) data-processing logic. That file is here and contains the variables:

@ebuhle Thanks for generating this summary file. I have started to work on this yesterday and my plan is to essentially focus on specific estimates that are resulting in extraordinarily high or low estimates of FW survival (i.e., M0_div_E_hat). I haven't made it too far yet but I did spend a good chunk of time yesterday looking back at the Duncan estimates and trying to understand why the FW survival estimates are relatively low and "wonky". Given the length of my comment, I won't go into much detail as to what I discovered but it will likely require us to revisit previous discussions on Duncan data (see here let alone a broad issue regarding translocated fish here)

I am hoping to spend some more time early next week and will hopefully have some more insight as to whether the unexpectedly high estimates of FW survival are real/believable or a result of shoddy data.

tbuehrens commented 3 years ago

@kalebentley and @Hillsont thanks for following up and doing the deep dig re: the abundance estimates. i'll go ahead and delete my excel file since we now know it is wrong. I'll leave it to you two to fix the small number of mainstem estimates since 2015 that need to be fixed.

will stay tuned re: what you find looking at the summary file

ebuhle commented 3 years ago

Yes, thanks so much for this deep dive.

In short, our current data wrangling is correct

I'm relieved to hear that this is true, or at least will be true after some revisions. Does any of this apply to Data_ChumSpawnerBioData_2019-12-12.csv as well?

The points about uncertainty, however, are exactly what I've been thinking / wondering about since this issue was raised.

That being said, this does bring up a subtle point about the reported estimates and specifically the estimate of uncertainty. Technically, the estimate of uncertainty includes the fish that are removed for brood. If I had to guess, ignoring this as we currently are will have a small effect on the actual uncertainly but worth highlighting and perhaps addressing in the future.

When stratified occurred, separate estimates of abundance were generated for males and females and recombined for a total estimate "post hoc". The recombining process has involved summing the means and SDs (i.e., squaring SD, summing variance, and sqrt to get combined SD). This assumes the estimates were normally distributed which is not appropriate in some years.

Right, both of these procedures violate our assumption that the observation error priors (posteriors from the M-R models) are lognormal. The first one -- subtracting broodstock and translocations from the estimate -- might actually be the more problematic, as a left-translated lognormal would have a negative lower bound. If the numbers removed are small and the observation CV is tight then it's probably not a big deal, but the formal "IPM way" to handle this would be to subtract the removals from the state S in the model rather than subtracting them from the data. (Not that I'm eager to deal with that, mind you.)

An alternative and probably better way to handle the sex-stratification would be to add the posterior draws for M and F and then compute summary statistics, as opposed to the other way around.

Looking forward to hearing what you've learned about the Duncan data, @kalebentley.

tbuehrens commented 3 years ago

@ebuhle and @kalebentley I think you guys might be overthinking the significance of the variance structures and lognormal shape when it comes to implementing math operators on our abundance estimates. A couple thoughts:

1) if we are adding male and female stratified estimates, assuming there is no covariance (which there shouldn't be if they were stratified and thus the data partitioned), then the variances ARE additive. Additionally, if a lognormal is a good approximation of the posterior pdf, then calculating the CV or lognormal SD of the sum should give us a good shape parameter. What is less straightforward is calculating the shape parameter for central tendency. The means are summable but the medians are not (median of sum of pdfs does not equal sum of medians of pdfs). Fortunately, there are moment matching equations we can use to accomplish this part.

2) for subtracting broodstock removals, eric is right that if you subtract the broodstock from all posterior draws, you end up (potentially) with negative values. I would NOT recommend this. Instead, i'd just assume that the posterior variance/SD or CV is unchanged after the broodstock removals. If you assume CV is unchanged then basically what you are assuming is that with the broodstock removals, you also removed some variance. If you assume variance or SD is unchanged, you are assuming that broodstock removals had no variance, which I think is the better assumption given how abundance was estimated. So basically you would subtract broodstock removals from the central tendency (posterior median or mean) but NOT from the posterior SD.

These two approaches should allow us to preserve quite accurate representations of the variance in abundance estimates as well as the posterior geometry, presuming it was lognormal to begin with (which is a damn good assumption in most cases).

kalebentley commented 3 years ago

@ebuhle and @tbuehrens - a couple of quick responses

I'm relieved to hear that this is true, or at least will be true after some revisions. Does any of this apply to Data_ChumSpawnerBioData_2019-12-12.csv as well?

I don't believe so. Looking at our BioData file, the data are set up in the same manner as the SpawnerAbundance file -- there's a "Location.Reach" and "Disposition" field by Year and Strata. Therefore, I'm assuming if "Location.Reach" and "Disposition" are the same then those data should be representative of fish that presumably spawned at the "Disposition" location. Even if this wasn't the case, I would be surprised if the BioData were substantively different between for fish with the same "Location.Reach" but different "Dispositions" given that (I believe) broodstock/translocated fish are supposed to be random samples.

One aside regarding our BioData summaries - these summaries constitute aggregations across an entire season. Therefore, if there are actual compositional differences in origin, sex, or age that do not match our sampling rates (e.g., age-5 fish return later but recovery rates are lower) then our compositional estimates (run reconstruction) will obviously be biased. I don't have a sense for how big of issue this is for chum in the Lower Columbia but certainly is important when I go to generate estimates of abundance for fall Chinook in the Lewis River (large tributary of lower Columbia). Is this issue typically ignored in other salmon IPMs?

Regarding uncertainty for stratified estimates -- @tbuehrens Yes - I agree re: "the variances ARE additive". As Todd was explaining to me how he was generating summaries statistics on combined estimates and knowing that moment matching was NOT occurring, I lost sight of the fact that moment matching is not needed for SDs (so long as they are converted to variances first, which I believe was being done). Although @Hillsont will probably be annoyed at me for initially steering him in the wrong direction, we should keep the "post hoc" summarized estimates since they should in theory be more accurate and the uncertainty is in fact correct. But just for perspective, Todd did say there were only a small number of stratified estimates and in most cases, the estimates were very similar. @Hillsont I can update the "Data_ChumSpawnerAbundance" file again if you've already switched these estimates from stratified to pooled.

Regarding uncertainty for broodstock removals -- @tbuehrens your explanation makes sense to me. Thanks for the feedback.

ebuhle commented 3 years ago

Is this issue typically ignored in other salmon IPMs?

The application of IPMs to salmon is nascent enough that no one's reckoned with it yet, but nonrandom compositional sampling is a huge issue in marine stock assessment (with its own micro-literature, natch). In general, there are several ad hoc approaches for dealing with heterogeneous sampling (i.e., overdispersion), but there's not much you can do about bias if you're just given the compositional summary. That said, it might be possible to account for it using a time- and age-stratified model to weight / partially pool the raw stream sampling observations. Without knowing the details of how it's done now, it sounds like this might fit into one of the projects @tbuehrens and I talked about the other day.

My point about subtracting broodstock was just that if S is lognormal and B is a positive constant, S - B (presumably zero-truncated) cannot be lognormal. If B is big enough relative to both the mean and SD of S, it could substantially change the shape of the distribution. I'm guessing it's NBD in practice because the number subtracted is generally small and the escapement estimates are generally so precise that your lognormals look almost normal. But offhand, I'm actually not sure of the properties of a lognormal approximation to a zero-truncated left-shifted lognormal that uses the SD vs. the CV of the unshifted original.

(Of course, it's a bit academic since we're just declaring the posterior of the sampling model lognormal anyway. And there's still the whole issue of putting a prior on a transformed parameter where computing a Jacobian is infeasible, which isn't wrong per se but doesn't really mean what we'd like it to.)

Hillsont commented 3 years ago

First off I want to apologize for the switch in how I summarized the adult data and the confusion it caused. I must have paused when pulling it together and when I re-started I reversed.

Between going through all of the adult spawner data and flagging suspect estimates and all of the conversations re: Duncan channels psi values, I thought it might be good for me to do a similar exercise with the Duncan channel adult and juvenile data. I was wondering if pre-spawn mortalities, retained eggs, partially spawned females, badly skewed sex ratios, or biased juvenile estimates might be reasons for the issue. I completed the review last Friday but didn't get to posting anything until today. There are definitely a couple of years that should be dropped/ censured if the decision is to keep Duncan North and South channels combined. We made an assumption back at the start of this to combine the channels and the thought was that the “partial” females released in one of the two channels wouldn’t be an issue. I’m seeing evidence that makes me think it might be an issue, especially when the majority of females in a channel come from the adult trap (BYs 2010, 2011, 2015, 2016, and maybe 2018). I'll send the review file to @kalebentley today and follow up directly with him. Sorry Kale, more footnotes for Duncan.

ebuhle commented 3 years ago

Thanks @Hillsont. As I said in #5, @kalebentley has some good ideas to deal with partial females and changing habitat area. I've looked for an effect of sex ratio on apparent productivity and haven't seen a hint of a relationship (which is sort of too bad, because part of me has been itching to add sex structure to the model since we have the data). I'll be interested to hear any other patterns you found, especially w.r.t. the juvenile estimates.

kalebentley commented 3 years ago

@ebuhle - Just to clarify, sex data i.e., "Data_ChumSpawnerBioData_2019-12-12" --> Sex (male, female), are not currently being used in the IPM to calculate "E_hat"? If not, the assumption is 50:50 sex ratio? I haven't looked at the sex ratio data by population and year in a while so perhaps the assumption of 50:50 is fine outside of maybe Hamilton Springs Channel.

As a quick heads up, the adult estimates from 2019 have been finalized. Therefore, I'll add these data/estimates to the "Data_ChumSpawnerAbundance_XXXX-XX-XX" and ""Data_ChumSpawnerBioData_XXXX-XX-XX" files here soon as I work to incorporate the updates/fixes to the existing spawner dataset.

Related to my question above regarding sex data, the spawner abundance estimate in Crazy Johnson (CJ) for 2019 is 74 spawners (%CV ~ 110%; only 2 recaps). The low abundance was due to low water in the fall and adults not being able to access CJ. Anyhow, there were only something like 14 carcass recoveries that were sampled for bio-data of which 10 were males and 4 were females. Assuming 50:50 sex ratio here versus something closer to 25% (4/14) clearly changes "E_hat". The estimated outmigration in spring 2020 was ~8K. All this to say that assuming 50:50 sex ratio may be a safe assumption in most years and location expect in years with extremely low abundance (and again Hamilton Spring Channel).

Hillsont commented 3 years ago

@ebuhle & @kalebentley Also a couple of years at Duncan Channels when trap catch was high and we couldn't balance out the males with translocated females. BY15 66F & 115M, BY16 105F & 154M.

ebuhle commented 3 years ago

Just to clarify, sex data i.e., "Data_ChumSpawnerBioData_2019-12-12" --> Sex (male, female), are not currently being used in the IPM to calculate "E_hat"? If not, the assumption is 50:50 sex ratio?

Yes, that's correct, the model is not sex-structured. Sex ratio is quite variable (note that those plots were made before we switched from Duncan_Creek to Duncan_Channel), but there's surprisingly no indication that it explains the process errors in smolt recruitment:

As a quick heads up, the adult estimates from 2019 have been finalized. Therefore, I'll add these data/estimates to the "Data_ChumSpawnerAbundance_XXXX-XX-XX" and ""Data_ChumSpawnerBioData_XXXX-XX-XX" files here soon as I work to incorporate the updates/fixes to the existing spawner dataset.

Sweet! Thanks.

kalebentley commented 3 years ago

Ok - thanks for providing the summary plot (and reminding me that you already had summarized this before). Broadly speaking, I guess the results above make sense given that the sex ratio does hover around 50:50 in most years/pops except Hamilton Springs (already noted). The other populations with seemingly more variable sex ratios -- St. Cloud, Horsetail, Multnomah, Hardy -- all have relatively low spawner abundances (and likely very low absolute carcass recoveries) and thus the sex ratio estimate appears to be highly uncertain,

Curious - Do you foresee any other benefit(s) to adding "sex-structure" to the model? Is the only downside to adding it the time it would take to implement?

ebuhle commented 3 years ago

Curious - Do you foresee any other benefit(s) to adding "sex-structure" to the model? Is the only downside to adding it the time it would take to implement?

The other interesting scenarios would be if sex "interacts" (in the 2-way contingency table sense) with age and/or rearing type. In that case, you would need to model sex structure differently than if males and females from the same cohort share all parameters but adult sex ratio simply affects per capita fecundity. I've looked into this and haven't seen much evidence for interactions; I won't paste the plots because this thread is already a beast, but you can play around with it here. I'd be happy to open another issue for sex structure if there's interest.