Modeling the hatchery life cycle: broodstock to smolt

I wanted to document my thoughts on the broodstock-to-smolt stage of the hatchery life cycle while they're still fresh, so that we (including Future Me) can easily pick up where we left off later this year or early next year. I'm going to open a companion Issue at ebuhle/salmonIPM with an annotated pseudocode, because (as usual) this is relatively easy to describe but pretty gnarly to implement. I'd encourage you, especially @tbuehrens and @kalebentley, to read that Issue in tandem with this one. I'll link it here, but since the repo is private you'll need to be a collaborator to view it. @tbuehrens already is, but @kalebentley let me know if you'd like me to invite you. [Edit: it looks like the link to that Issue below this post only shows up if you're logged in and are a contributor to the salmonIPM repo.]

Let's start with a few key insights and proposed shifts in our modeling practices that allowed me to formulate this approach. First, a general disclaimer: the notation used below is subject to change without notice. There's a broader salmonIPM notational shakeup on the horizon, but for now I've tried to indicate where I intend to rename existing parameters.

We need to disambiguate the set of hatchery populations (defined by the S-R function that applies to them) from the set of known-origin populations. Currently these are the same, which is why Duncan Channel wasn't part of the initial dispersal model. However, the known-origin set is the same as the set that receive adult transfers / translocations, so they get the same index / notation: which_O_pop. (Maybe this will this come back to bite us later? I can't foresee why it would, but let me know if you can.)
It's time to take the broodstock mining rate (currently B_rate, to be renamed b) more seriously as a parameter and the broodstock penalty more seriously as a likelihood component. Actually, it turns out the proposed model doesn't use the B_take_obs "likelihood" anymore because the predicted state now becomes hatchery spawners, but that just shifts the stage at which an arbitrary(ish) likelihood needs to be applied. More on this below.
Amazingly, there are no additional parameters in this formulation! That's because it uses the observed annual disposition frequencies from each location to calculate the conditional disposition distribution, given that a fish was taken as broodstock. This becomes unconditional when multiplied by b (cf. the logic of SAR and conditional age-at-return). We assume this conditional broodstock distribution is known w/o error, issues such as this one notwithstanding for now. The idea is that if we are forced to model the total number of broodstock as uncertain for computational reasons, we can at least use their known disposition distribution (vs. treating it as another simplex-array parameter to estimate, analogous to the dispersal matrix P_D). This works because the redistribution step involves multiplication and addition (i.e., matrix multiplication) whereas in principle the total broodstock removal should be subtraction, but that breaks the MCMC so we turn it into multiplication (by b) too.
One cool aspect of this approach is that, as described below and in the pseudocode, it predicts spawner age and sex composition -- in both natural and hatchery populations -- as a mixture across all source populations that contributed to those spawners through dispersal and/or transfers, weighted by their relative abundances. Previously this model, and all salmonIPM models, excluded HOR from the age-composition likelihood, but now we can use all the compositional data. This resolves something that had been bugging me.

Now let's turn to the critical new (or mostly new) assumptions underlying my proposed broodstock-to-smolt model.

The set of natural and hatchery / channel populations in the model is closed, i.e. all broodstock removed go to one of the modeled hatcheries or spawning channels and all spawners in the hatcheries or channels were taken from the modeled populations. @Hillsont and @kalebentley have mentioned at least some cases where this isn't true, e.g. broodstock being taken to Big Creek Hatchery. What to do about it? If Big Creek Hatchery is the only one, we could just include it in the model; I've been reconsidering the decision to exclude it anyway. But that's not possible with the existing data, even under our current model. Big Creek Hatchery doesn't appear in spawner_data or juv_data, and only appears in bio_data as an origin. There are no records of broodstock transfers to Big Creek Hatchery or of spawner or smolt abundance there. As far as our data are concerned, adults from Big Creek Hatchery fall from the sky. We would need the same data that we have for the other three hatcheries ... but then the same assumptions would apply, i.e. we would have to expand the model to any OR pops that send broodstock to or receive adults from Big Creek Hatchery. (FWIW, this issue may have existed all along. I compute B_take_obs by summing spawners returning to each location that had a different disposition, so if Big Creek Hatchery is not listed as a disposition then it's not clear how those removals / transfers are accounted for in the data. If spawners are first taken to one of the modeled hatcheries, recorded as their disposition, and subsequently they or their eggs are transferred elsewhere, then that's fine for now but not fine for the broodstock-to-smolt model.)
Broodstock collection is random with respect to age, sex and origin, therefore adult recruits from a given pop (i.e. origin, whether identifiable or not) carry their demographics along with them when they are transferred / translocated from their return location to another pop (i.e., disposition). A very defensible assumption, but we'll need to validate it with data. (In particular, I wonder about sex selectivity -- wouldn't you be trying to meet some quota of males and females?) For that matter, we've also been assuming dispersal is random with respect to age and sex, and should check that too. I have an old EDA script that looked at some of these crosstabs; I'll repurpose it and post the results in this Issue when I have time. A related assumption, and one we've made all along, is that age, sex and origin are conditionally independent, so we only need to keep track of the margins of that potential horrible 3-way contingency table.
We have to make some decisions and assumptions about which states within the broodstock $\rightarrow$ hatchery / channel spawner $\rightarrow$ hatchery smolt process (i.e., abundance and age, sex and origin composition) are "observed" and, in the case of hatchery / channel spawner and hatchery smolt abundance, what the SD of those observation errors should be. This was one of the hardest parts to wrap my head around. I'm so used to thinking of B_take_obs as the "observation", at least in the sense that it gets a likelihood penalty, that my first inclination was to treat the broodstock -- its abundance and age, sex and origin composition -- as the relevant data, and then hatchery spawner abundance and composition would somehow be a deterministic function of those ... and that's where I got stuck. Eventually I decided it makes more sense, and is more coherent with the way the rest of the IPM works (in particular the dispersal component we added this year), to look at things from the POV of the spawning population vs. the "cohort", i.e. the population just before it undergoes some survival (or broodstock removal) and temporal or spatial redistribution process. In my proposed formulation, then, the new observations are: spawner abundance in the hatcheries, the age, sex and origin frequencies of those spawners, and hatchery smolt abundance. This is elegant because those are already the likelihood components for natural populations (and Duncan Channel).
- For hatchery spawner abundance, we just need to determine what tau_S_obs for each observation (possibly invariant) should be. I've always felt that hatchery / channel spawner abundance, like B_take_obs, was essentially known without error and that modeling it as uncertain was a compromise with reality. But there must be some uncertainty, right? Fish get lost or double-counted, eggs get spilled, etc. Do we have any basis for estimating / guesstimating its magnitude? One starting point would be the fact that we currently apply the aforementioned lognormal penalty with SD = 0.05 to B_take_obs. Now we're shifting the observed state from broodstock to hatchery spawners, so a similar SD could apply. The awkward part is, as I've noted before, that penalty SD is actually on par with our "real" sample-based estimates of tau_S_obs. Another reference point is that we currently treat tau_S_obs for Duncan Channel as unknown (since reported values are 0) and impute it. This is undesirable behavior (see similar issue with tau_M_obs), so perhaps whatever logic we use for hatchery spawners should apply to Duncan Channel too?
- For hatchery spawner composition, the only obvious route is to use the multinomial / binomial likelihoods just like for the wild pops. It's a little weird because, as mentioned above, my proposed broodstock-to-smolt process model already uses some of the same information in its "cohort" form -- namely the disposition-frequencies of broodstock taken from each location -- to define the conditional transfer probabilities. That wrinkle aside, we again have a situation where hatchery spawner age, sex and origin composition are "known", yet we model them as stochastic observations. If the sample sizes are big enough that the multinomial likelihood is super tight, it doesn't practically matter. This is generally true in Grays Hatchery and often in the others, but not always.
- I am perfectly willing to believe that hatchery smolt releases really are measured with error! The question is just how to quantify it. There must be some sort of methodological basis we could use, but I defer to those more familiar with these programs for ideas.
Hatchery egg-to-smolt survival is density-independent. Seems reasonable, especially given the short FW residence of chum, though TBH I'm not sure what the literature says on this.

I believe those are the biggies. Hopefully others can point out if I've missed something obvious.

OK, discuss!

@ebuhle. Thanks for the extremely detailed post! I gave it a thumbs-up simply to acknowledge its presence and that I quickly scanned it. That said, I'm gonna need to re-read and digest it at a later point in time. Maybe (just maybe) we can aim to work on the IPM this fall (as opposed to waiting until next spring) and can start back on this topic.

Hey @ebuhle,

In a long overdue reply, I have provided responses to the three (of the four) comments/questions you posted that were related to “the critical new (or mostly new) assumptions underlying [your] proposed broodstock-to-smolt model”. (NOTE: I haven’t looked at the companion Issue #54 you posted to the salmonIPM thread but also plan on reading through that sometime soon).

As I was reading through this post, I realized that there was some general information on the “hatchery/channel populations” in our current chum IPM that would be helpful to have summarized in one location. Also, it would be helpful to provide a brief overview on how/why we plan to use the chum IPM to evaluate the performance of these hatchery/channel populations to contextualize our efforts to model the "hatchery" life-cycle. Therefore, I'm going to write a follow-up response. There’s probably a better location for this information but for now, it'll be useful (at least for me) to have it here (and we can move it later if desired).

Ok - getting back to the assumptions...

1. The set of natural and hatchery/channel populations in the model is closed

This assumption is probably/mostly satisfied for Duncan Hatchery given that all broodstock source “populations” that are used for this program are in our datasets/model and subsequently monitored via spawning ground surveys (thus recruits from this program can be detected/estimated). There are probably some locations where hatchery recruits could/do return that aren’t monitored (well) but I would expect this to be pretty low within the designated population (area) of the Lower Gorge.

This assumption is not true for the Grays Hatchery program based on the way the data are compiled now but could probably be mostly true with updates to our data files. The biggest thing here is partitioning the broodstock that are collected and used to produce chum fry that are planted back into the Grays Basin vs. the broodstock that are collected and used to produce eggs/fry that are planted/shipped other locations (i.e., Big Creek Hatchery, Peterson RSI, and Skamokawa) that are not currently in our model. Based on the way Grays hatchery operations data have been collected, we probably cannot partition the adults and subsequent eggs/fry perfectly but can probably come up with something logical and consistent. Since the non-Grays locations are not in our IPM model (and thus do not meet the closure assumption), we should remove/ignore these outplants and come up with a way to denote the broodstock take as “loss” (perhaps something similar to imposing a fishery impact). The only last thing we’ll need to decide is what to do with the Big Creek Hatchery origin recruits that can (and do) show up in our data set. Specifically, there are 32 observations of Big Creek Hatchery origin adults in our BioData file across all years (compared to the 374 Grays hatchery adults). One option, for now, could be simply ignore them (i.e., pretend they are the same as a natural-origin) and acknowledge that our pHOS estimates for the Grays may be slightly underestimated. I’m sure we could come up with some sort of ad hoc analysis/summary to approximate how much pHOS is underestimated by ignoring these Big Creek Hatchery strays. I can’t think of another option but wondering if other can. Similar to what I highlighted above for Duncan Hatchery, there are certainly going to be locations/populations that Grays Hatchery recruits could return/stray to that are not in our current IPM model (e.g., Chinook, Elochoman). I’ve comfortable highlighting this limitation and living with it for now.

While this assumption is satisfied for the Lewis Hatchery program as it pertains to broodstock collection (which comes from I-205) but the assumption is not satisfied with regards to detecting/estimating Lewis Hatchery origin adults/recruits. Our model currently does not include a Lewis population because estimates of abundance have not been generated. Even if estimates were generated, I’m not certain what, if any, data have been collected to identify Lewis Hatchery origin adults in the Lewis Basin. Across all years in our BioData file, we have a total of 10 Lewis Hatchery origin adults (juvenile plants began in 2011). These are technically strays based on the objective of this program. As an aside, should revisit how we want to treat these hatchery strays in our hatchery evaluation. We included these recruits in our preliminary hatchery evaluation by lumping them with the I-205/Washougal population. While not technically wrong, our evaluation of Lower Gorge, I-205/Washougal, and Grays/Chinook is a bit inconsistent.

Similar to Duncan Hatchery, this assumption is probably/mostly satisfied for Duncan Channel. I know you know this but worth highlighting that this “population” is not a hatchery program but rather a population location that receives translocated adults that we can detected recruits using genetic stock identification. Duncan Channel has some nuances that make it unique and potentially not that “transferable” (e.g., all adults have to be translocated into Duncan Channel even if they are recruits from Duncan Channel; we don’t differentiate translocated adults that return back to the Duncan Creek trap and hypothetically would have spawned in Duncan voluntarily vs. adults that returned to mainstem spawning locations and would not have returned to Duncan had we not manually moved them in boat and truck). Nonetheless, it may be worth changing how the data are organized to make it more universal for other hypothetical translocation situations.

2. Broodstock collection is random

I don’t totally understand this assumption or rather why it is necessary. While broodstock collection should be random outside of the sex selectivity that you highlighted (and years early in the Grays dataset where the broodstock collection location increased the odds of collecting hatchery-origin adults), I don't see why we have to assume the demographics of broodstock is the same as the naturally spawning adults because the broodstock and naturally spawned adults have their own set of bio-data.

Also, I need some help interpreting the 2nd half of your last sentence, “…so we only need to keep track of the margins of that potential horrible 3-way contingency table.” I think I understand the “conditional” part of the sentence (i.e., we’ve been summarizing our bio-data using samples/fish that have all three components – origin, sex, and age – which accounts for potential interactions), but need some help with “keeping track of the margins”. What does this mean?

3. Observed states within the in-hatchery life-cycle and their corresponding observation errors

I’m punting this one for now. I need to figure out what information/data is collected during hatchery rearing to see what options we have here. I’ll circle back on this one shortly.

4. Hatchery egg-to-smolt survival is density-independent

Hmm…I would think that this is safe assumption but we should have some data to evaluate it.

Thanks for this @kalebentley, very helpful.

Re the closure assumption, it sounds like Lewis Hatchery may end up being the most problematic case, albeit in the direction (i.e., unobserved returns) that already affects the hatchery model in its current release-to-return form, as opposed to the direction (i.e., broodstock sent to unmodeled dispositions) that will specifically affect the broodstock-to-release component. Unobserved returns, whether from a hatchery or natural population, are indistinguishable from mortality and so will manifest as lower estimated SAR. IIRC, SAR for Lewis Hatchery in the existing model is on the lower side, so that tracks.

I know you know this but worth highlighting that this “population” is not a hatchery program but rather a population location that receives translocated adults that we can detected recruits using genetic stock identification. Duncan Channel has some nuances that make it unique and potentially not that “transferable” (e.g., all adults have to be translocated into Duncan Channel even if they are recruits from Duncan Channel; we don’t differentiate translocated adults that return back to the Duncan Creek trap and hypothetically would have spawned in Duncan voluntarily vs. adults that returned to mainstem spawning locations and would not have returned to Duncan had we not manually moved them in boat and truck).

Right, of course, but the relevant distinction here is between natural populations where spawners return and do their thing, and hatchery / channel populations where all spawners are deliberately collected and transferred in. The latter set of populations are also the ones whose origins are identifiable. Some of these (i.e., Duncan Channel) may then undergo natural reproduction while others (i.e., hatcheries) have artificial propagation. That's what I meant by disambiguating the categories of "transfer-recipient / known-origin vs. natural-return / unknown-origin" from "hatchery vs. natural reproduction", where the latter will be defined by the S-R function. These categories have been conflated in the model thus far because there's been no need to distinguish them, but the broodstock-to-smolt component changes that.

The process model (and the bio_data) does account for local self-recruitment to Duncan Channel as opposed to translocated adults from other locations, even though they intermingle on the spawning grounds.

2. Broodstock collection is random

I don’t totally understand this assumption or rather why it is necessary. While broodstock collection should be random outside of the sex selectivity that you highlighted (and years early in the Grays dataset where the broodstock collection location increased the odds of collecting hatchery-origin adults), I don't see why we have to assume the demographics of broodstock is the same as the naturally spawning adults because the broodstock and naturally spawned adults have their own set of bio-data.

The bio_data are the observations, whereas what's at issue here is the process model. The approach I'm proposing would predict the age-, sex-, and origin-composition of spawners in each hatchery / channel as a mixture of the respective source populations, weighted by the relative numbers of broodstock they contribute. In order for this to be valid, broodstock collection must be random w.r.t. those three demographic characteristics. If that assumption were seriously violated, then we would have to additionally estimate transition matrices (possibly time-varying) representing the "selectivity" of broodstock collection from each wild pop to each hatchery / channel w.r.t. each of the three demographics. As if that's not bad enough, it would get even uglier if there were multi-way statistical interactions among age, sex, origin, and disposition (local vs. broodstock). On that point...

Also, I need some help interpreting the 2nd half of your last sentence, “…so we only need to keep track of the margins of that potential horrible 3-way contingency table.” I think I understand the “conditional” part of the sentence (i.e., we’ve been summarizing our bio-data using samples/fish that have all three components – origin, sex, and age – which accounts for potential interactions), but need some help with “keeping track of the margins”. What does this mean?

Consider the 3-way contingency table of age, sex and origin, which is a way of summarizing the bio_data within a given population and year. In principle, there could be interactions up to order 3 among the margins of this table -- meaning, e.g., age is not statistically independent of sex, or the interaction between age and sex depends on origin, etc. In practice, we have always assumed independence (or close enough), which has allowed us to model age, sex and origin as unrelated processes and to construct the observation likelihood from the three marginal frequency distributions as opposed to the joint 3-way cross-tabulation. If this weren't the case, the model as it exists would be quite a bit uglier and more unwieldy. Way back when, I checked these assumptions against the bio_data and they looked reasonable. Now we need (OK, strongly prefer) to make an analogous assumption regarding the broodstock collection process, i.e. that the margins of a 4-way table with age, sex, origin, and disposition are mutually independent. I'm working on checking this one now; stay tuned...

Broodstock collection is random with respect to age, sex and origin, therefore adult recruits from a given pop (i.e. origin, whether identifiable or not) carry their demographics along with them when they are transferred / translocated from their return location to another pop (i.e., disposition).

I made some quick and dirty plots to check this random-sampling assumption in the subset of populations that were broodstock donors. Age and sex look fine; there are a few statistically significant discrepancies between the age distributions of adults taken as broodstock and those left to reproduce naturally, but the differences are small and overall there's no systematic bias.

Origin, coded here as known or unknown where the former is a proxy for hatchery / channel and there is typically only one or at most two such origins present in a given population, is more problematic. Broodstock taken from Grays River (recorded as Grays MS, although as we've discussed, this is not always accurate) are disproportionately Grays Hatchery origin. Some other populations show a trend in the same direction, albeit much weaker.

I'm not sure what to do about these patterns at this stage. We would certainly prefer to start with the simplifying assumption of random broodstock sampling in any case, so I guess this is just a heads-up to pay attention to origin-frequencies when doing posterior predictive checking on the broodstock-to-smolt model once we get it built and working.

While making these plots, I realized that bio_data doesn't include any hatchery locations. We'll need to get those into the data set before we can proceed with fitting a broodstock-to-smolt version of the IPM. Maybe this is what @kalebentley meant by

As I was reading through this post, I realized that there was some general information on the “hatchery/channel populations” in our current chum IPM that would be helpful to have summarized in one location.

I also looked into this assumption:

Hatchery egg-to-smolt survival is density-independent.

With the caveat that these are observations not states, and that ignoring observation error will tend to overestimate the strength of density dependence (but presumably both $S^\text{obs}$ and $M^\text{obs}$ are more precise in hatcheries), it does indeed look like production in Duncan Hatchery and Grays Hatchery is density-independent. By contrast, the natural populations generally show Ricker-type log-linear density dependence. Lewis Hatchery, however, seems to be the exception.

Is there any obvious reason why Lewis Hatchery would show density-dependent fry production, whereas the other hatcheries do not?

Hey @ebuhle,

Thanks for pulling these summaries together. I wanted to quickly respond to the pattern in origin composition you highlighted in your last post and specifically for Grays River....

Origin, coded here as known or unknown where the former is a proxy for hatchery / channel and there is typically only one or at most two such origins present in a given population, is more problematic. Broodstock taken from Grays River (recorded as Grays MS, although as we've https://github.com/ebuhle/LCRchumIPM/issues/18#issuecomment-1603007854, this is not always accurate) are disproportionately Grays Hatchery origin. Some other populations show a trend in the same direction, albeit much weaker.

Below is a plot of pHOS (percentage hatchery origin spawners) and pHOB (percentage hatchery origin broodstock) for the Grays Basin (MS, WF, and CJ combined) and the Grays River Hatchery, respectively. I grabbed these estimates from an HGMP (Hatchery Genetics Management Plan) document that Brad Garner compiled last year. The estimates of pHOS shown here probably deviate ever so slightly from the IPM estimates but should be very close...

I want to show that the pattern you highlighted (i.e., "Broodstock taken from Grays River...are disproportionately Grays Hatchery origin."), which pooled results across all years, is likely a result of exceptionally high pNOB levels in the first few years of the hatchery program (2004-2006) and one recent year (2017). I vaguely remember @Hillsont explaining that the high pNOB levels observed in the early years were attributed to the broodstock collection location, which was modified and appears to be "working" given that since 2007, pNOB and pHOS have averaged 5.5 and 5.8, respectively. Overall, I think the assumption of randomized broodstock collection (i.e., representative of the donor stocks) is met concerning Grays River and specifically Origin. At the moment, I cannot speak to observed patterns at Hamilton Channel or Horsetail but those are pretty small proportions and likely equate to pretty small numbers of actual fish collected for broodstock.

As for your last question...

Is there any obvious reason why Lewis Hatchery would show density-dependent fry production, whereas the other hatcheries do not?

...I can't say without more "digging" but it's worth noting that the "Lewis Hatchery" and "Duncan Hatchery" are essentially one in the same. That is, broodstock collected for the two "programs" are reared at the same facilities though kept in separate rearing troughs. The point being - I don't why one "program" would exhibit density dependence and not the other. It may be worth discussing with @Hillsont and Brad as to what might be going on here but my first thought is that this pattern is spurious.

I vaguely remember @Hillsont explaining that the high pNOB levels observed in the early years were attributed to the broodstock collection location

Yeah, I remember that too; @Hillsont actually mentions it in the post I linked. Nice to see it illustrated with data. I agree this temporal perspective is reassuring, insofar as the random-sampling assumption appears valid at the Grays Basin level after the first three years. Unfortunately, when it comes to retrospective fitting we can anticipate a significant lack of fit to those 2005-2006 observations. They will have high leverage due to the tight contours of the multinomial likelihood, which in turn may induce hard-to-predict biases in other components of the model. We'll just have to keep an eye on it.

Also worth bearing in mind that given the available data, we can only model Grays Basin broodstock as if they were all taken from Grays MS. That may ironically work in our favor, because Grays MS experienced a much more pronounced spike in $p_\text{HOS}$ in 2004-2006 than was seen at the basin level as in that HGMP figure.

I can't say without more "digging" but it's worth noting that the "Lewis Hatchery" and "Duncan Hatchery" are essentially one in the same. That is, broodstock collected for the two "programs" are reared at the same facilities though kept in separate rearing troughs.

Oh, I somehow didn't know that! Well, I like your suggestion that the pattern is spurious. :+1:

@tbuehrens, I'm wondering if you have any thoughts about these issues raised in the OP regarding the observation errors to use for modeling hatchery spawner and fry / smolt abundance:

For hatchery spawner abundance, we just need to determine what tau_S_obs for each observation (possibly invariant) should be. I've always felt that hatchery / channel spawner abundance, like B_take_obs, was essentially known without error and that modeling it as uncertain was a compromise with reality. But there must be some uncertainty, right? Fish get lost or double-counted, eggs get spilled, etc. Do we have any basis for estimating / guesstimating its magnitude? One starting point would be the fact that we currently apply the aforementioned lognormal penalty with SD = 0.05 to B_take_obs. Now we're shifting the observed state from broodstock to hatchery spawners, so a similar SD could apply. The awkward part is, as I've noted before, that penalty SD is actually on par with our "real" sample-based estimates of tau_S_obs. Another reference point is that we currently treat tau_S_obs for Duncan Channel as unknown (since reported values are 0) and impute it. This is undesirable behavior (see similar issue with tau_M_obs), so perhaps whatever logic we use for hatchery spawners should apply to Duncan Channel too?

I am perfectly willing to believe that hatchery smolt releases really are measured with error! The question is just how to quantify it. There must be some sort of methodological basis we could use, but I defer to those more familiar with these programs for ideas.

@tbuehrens - moving this topic back to the top of your email.

ebuhle / LCRchumIPM