Modeling hatcheries and straying

ebuhle / LCRchumIPM

This is the development site for an Integrated Population Model for chum salmon in the lower Columbia River.

MIT License

4 stars 1 forks source link

Modeling hatcheries and straying #16

Closed ebuhle closed 1 year ago

ebuhle commented 2 years ago

I wanted to give an update on one of our high-priority tasks for this year's SOW, namely expanding the IPM to explicitly model releases of hatchery (or known-origin) juveniles and their adult returns and distribution among "natural" populations. After having worked out a schematic or pseudocode for a couple of alternative ways to implement this in salmonIPM, this week I've turned my attention to taking a closer look at the relevant data. I should have done this sooner, because it's very revealing of what we will and won't likely be able to do with the available information.

The first thing that jumps out, which I already knew, is that the numbers of known-origin adults recovered in any given year are very sparse -- mostly zeros or single digits, with the exception of Grays Hatchery in some years. Looking at this more carefully, I'm doubtful that we will be able to support a year-varying model of the dispersal / straying matrix. So instead I looked at patterns in the aggregated adult returns across all years. Here is a summary of the total count of recoveries from each known origin in each recipient pop. S_pop is the estimated number of adults represented by each of those counts, calculated by expanding the annual proportional origin-composition vectors (including Natural spawner) in each pop by the corresponding adult abundance point estimates S_obs, and then summing over years. These estimates are also expressed as proportions (prop) within each origin. Since this is just a quick-and-dirty exercise for exploratory purposes, I haven't attempted to propagate the uncertainty in these derived quantities arising from both the spawner observation errors tau_S_obs and the multinomial sampling variance in the origin-composition data. Naturally the latter would add considerable uncertainty in cases where the annual counts are small.

               origin              pop count  S_pop prop
1      Duncan Channel Hamilton Channel    43  271.9 0.21
2      Duncan Channel   Hamilton Creek    21   49.1 0.04
3      Duncan Channel             Ives    58  469.9 0.36
4      Duncan Channel      Hardy Creek    42  107.2 0.08
5      Duncan Channel     Duncan Creek   131    0.9 0.00
6      Duncan Channel        Horsetail     8   49.6 0.04
7      Duncan Channel         St Cloud     6   52.8 0.04
8      Duncan Channel        Multnomah    12   77.7 0.06
9      Duncan Channel            I-205    20  238.0 0.18
10    Duncan Hatchery Hamilton Channel    15   43.8 0.15
11    Duncan Hatchery   Hamilton Creek    14   25.2 0.09
12    Duncan Hatchery             Ives    29   84.6 0.29
13    Duncan Hatchery      Hardy Creek     1    3.4 0.01
14    Duncan Hatchery     Duncan Creek     0    0.0 0.00
15    Duncan Hatchery        Horsetail     0    0.0 0.00
16    Duncan Hatchery         St Cloud     0    0.0 0.00
17    Duncan Hatchery        Multnomah     4    8.3 0.03
18    Duncan Hatchery            I-205     8  130.4 0.44
19     Lewis Hatchery Hamilton Channel     0    0.0 0.00
20     Lewis Hatchery   Hamilton Creek     1    3.6 0.06
21     Lewis Hatchery             Ives     3   13.5 0.23
22     Lewis Hatchery      Hardy Creek     0    0.0 0.00
23     Lewis Hatchery     Duncan Creek     0    0.0 0.00
24     Lewis Hatchery        Horsetail     0    0.0 0.00
25     Lewis Hatchery         St Cloud     0    0.0 0.00
26     Lewis Hatchery        Multnomah     0    0.0 0.00
27     Lewis Hatchery            I-205     5   41.6 0.71
28     Grays Hatchery         Grays CJ   200 1804.0 0.23
29     Grays Hatchery         Grays MS  1032 3092.3 0.40
30     Grays Hatchery         Grays WF   299 2829.3 0.37
31 Big Creek Hatchery         Grays CJ     8   64.5 0.27
32 Big Creek Hatchery         Grays MS    14   60.2 0.25
33 Big Creek Hatchery         Grays WF     8  116.8 0.48

Here's a similarly quick-and-dirty exploratory plot of the prop estimates:

Next I wanted to look at the dispersal kernel, i.e. the proportion of adults from each origin returning to a given pop as a function of pairwise distance. This bears directly on the question of whether we'll be able to parameterize the straying matrix by a lower-dimensional function, and thus whether it might be possible to extrapolate hatchery (or known-origin) stray rates to the natural populations whose adults are not specifically identifiable. The results are not encouraging; in fact, there's very little indication of decay-by-distance at all (note that Big Creek Hatchery is missing because it's not included in the pairwise distance data):

In light of this, I'm inclined to take the following approach:

Model the straying matrix as full-rank, i.e. a vector of discrete probabilities for each return location that sum to 1 within each known origin, rather than trying to parameterize it by a dispersal kernel.
Model the straying matrix as time-invariant.
Don't try to extrapolate dispersal from the known origins to the other populations (this would be a longer-term goal anyway, but just thinking ahead).
Possibly omit Big Creek Hatchery since it's not in the ESU and only has 30 recoveries in total? Then again, you could make a similar argument about Lewis Hatchery insofar as I just don't know how much information we can hope to squeeze out of a total of 9 recoveries. I'm open to suggestion on either one, though.

Sorry to be somewhat of a downer, but I found this exercise informative. Would love to hear any thoughts / suggestions before I start mucking around with the Stan code in earnest.

tbuehrens commented 2 years ago

Maybe I'm confused but the plots seem backwards...ie the proportion of origin for the hatchery programs is not the function of dispersal but rather a human decision (which stock we capture and transport to the hatchery)...where we'd expect [natural] dispersal is from wild pop to wild pop and from hatchery to wild...am I missing something?

ebuhle commented 2 years ago

[natural] dispersal is from wild pop to wild pop and from hatchery to wild

Correct, everything I'm showing here represents dispersal from a known origin (i.e., hatchery or spawning channel) to a given pop (i.e., ostensibly wild population). So, for example, the bars for Duncan Hatchery sum to 1 over all the locations where adults from that hatchery naturally strayed / returned to. Does that help, or is it still unclear?

tbuehrens commented 2 years ago

ah, ok I see that now...where are the zero's on your distance decay plot?...i actually don't think it looks as bad as you think...that said, I agree it's not as clean as you might hope....i suspect some of the transferring hatchery fish around and/or rearing them on well water could be part of the problem...the other problem that seems present in the data is one of pied-piper effect or destination population size...strays more apt to end up in large pop areas, straying distance being equal. can you add the zeros to the dispersal graph and see if it looks better (e.g., only 3 data points for Grays).

ebuhle commented 2 years ago

Yes, you're right, because of the way summarize() works the data frame shown above only includes combinations of origin and pop that actually appear in the data. After complete()-ing it to make all the zeros explicit, the dispersal plots look like this:

Even if all the origins are pooled, I still think you'd be awfully hard-pressed to fit a parametric kernel to these data that would pass the red-face test (and again, I'm not even showing the observation error in the estimates). You make good points about some of the mechanisms that might contribute to the noise in these relationships. I would only add that salmonid dispersal in general is a noisy process that's proven difficult to generalize, and with such a small number of observations based on sparse underlying recoveries, I'm not really surprised that we don't see a clearer signal emerge from the noise.

Hillsont commented 2 years ago

I'm having issues logging into GitHub so replying here. Eric, I think I'm following your logic train on how to expand raw origin counts to expanded origin counts per site. My run-reconstruction files stop around 2019 so I can't double check. I do want to comment on your possibility of using the Big Creek program in the analysis. Right now, the data set your being given to work with only includes recoveries of Big Creek program in the Grays Basin (Grays MS and WF along with CJ recoveries), we actually see more Big Creek Hatchery recoveries in the Elochoman & Skamokawa basins, a data set you're not currently including in the model (there are also a few Grays hatchery program recoveries in these basins not currently included in the model). I wonder if including these ELO/SKA recoveries (of both Grays and Big Creek programs) will change your results/mind. Including Grays recoveries in ELO/SKW is a small bump to Grays HORs return but a larger bump to Big Creek HORs in the ELO/SKA populations. Lastly, should we get Big Creek returns to Oregon pops included?

Thanks

kalebentley commented 2 years ago

Hey @ebuhle, Just now seeing this thread - I didn't receive an email notification and would have completely missed had Todd not brought it to my attention.

One quick question regarding the summary table above

For Duncan Channel [origin] fish ending up in Duncan Creek [pop], the table indicates a count 131 samples (which is twice as many as any other pop but the S_pop is 0.9 and thus the corresponding prop is 0. How is this possible?

One observation:

I have never realized that 0 Duncan Hatchery fish have been captured at Duncan Creek! That's crazy.

One question/thought:

I don't believe our "BioData" dataset includes all of the "origin" information we have to inform straying. For instance, as Todd highlighted above, Big Creek fish have been recovered in other WA populations that are not currently included in our data set (e.g., Elochoman R.). Also, I imagine a decent chunk of Big Creek fish return to Big Creek and other Oregon tribs. Since we are using proportions here, how valid is it to examine dispersal if we are omitting populations where a non-negligible number of fish could be returning? For example, are Grays Hatchery fish returning to other WA/OR populations that we aren't including here? Similarly, @Hillsont, do our current methods allow us to detect all origins? For instance, are we analyzing otoliths outside the Grays Basin and are we analyzing any PBT data outside of the Lower Gorge?

Kale

Hillsont commented 2 years ago

Good points Kale,

I do want to add that the purpose of the Duncan Creek project is to add to the persistence/ viability of the overall Lower Gorge chum population, not just the specifically into the Duncan Creek portion, so strays into other LG pops count. Yes, BPAs intent was to increase off-mainstem CR spawners to reduce their obligation to providing flows for the mainstem pops of the Ives, HT, Mult, and St Could areas by adding spawning habitat in Duncan Creek.

To Kales last question. Yes, we are looking at otolith decodes for origin determinations in LG Costal Strata populations (PBT for Bonneville to Lewis populations and otoliths for IMW streams to Columbia mouth (IMW, ELO/SKA, Grays/Chinook populations). We are geno-typing the Grays hatchery brood so if they end up in near Bonneville LCR pops (I-205 to Bonneville where we are using PBT) we should see that. We do not use PBT for Coastal strata origin IDs.

Thanks

ebuhle commented 2 years ago

Let's tackle the (relatively) easy one first.

One quick question regarding the summary table above

For Duncan Channel [origin] fish ending up in Duncan Creek [pop], the table indicates a count 131 samples (which is twice as many as any other pop but the S_pop is 0.9 and thus the corresponding prop is 0. How is this possible?

Good catch! The prop was not identically zero, but it rounded to zero at two decimal places. However, S_pop being less than the raw count is clearly an error. Long story short, this happened because I pulled the spawner abundance data used to make these plots from an intermediate object in our main analysis pipeline, where pop is defined based on disposition. You may be having an unwelcome flashback to the first year of the project when we went around and around on this issue before finally figuring it out. For the IPM, we treat Duncan Channel as the population and ignore Duncan Creek because only a handful of adults have been recorded spawning there. But for thinking about (natural) straying / dispersal, the returns to location == "Duncan Creek" are what matters, regardless of disposition.

I've now fixed this, and the distributions of known-origin returns by location and distance from source have shifted somewhat, but the main story remains the same.

To be clear, none of this will directly affect the IPM, which will be fitting the raw S_obs and origin-composition data and not these derived estimates. There is a related hairball, though, which I didn't recognize until I compiled these data. If we want to use our model of straying to predict p_HOS, replacing the current "hatchery spawners fall from the sky" approach, then w.r.t. Duncan Channel it will be necessary to somehow represent the number of adults from each origin translocated in. That is, S_add_obs will have to go from a scalar to some sort of vector of numbers by origin, and maybe the removal data B_take_obs will have to change to reflect the number from each origin removed from the donor pops? Ugh! And now this error has made me realize that we will have to represent Duncan Creek after all, to account for all the known-origin recoveries! I haven't fully wrapped my head around this yet, so let's just put a sticky note on it for now.

One observation:

I have never realized that 0 Duncan Hatchery fish have been captured at Duncan Creek! That's crazy.

That surprised me too! And it's directly from bio_data, not affected by this error.

One question/thought:

I don't believe our "BioData" dataset includes all of the "origin" information we have to inform straying. For instance, as Todd highlighted above, Big Creek fish have been recovered in other WA populations that are not currently included in our data set (e.g., Elochoman R.). Also, I imagine a decent chunk of Big Creek fish return to Big Creek and other Oregon tribs. Since we are using proportions here, how valid is it to examine dispersal if we are omitting populations where a non-negligible number of fish could be returning? For example, are Grays Hatchery fish returning to other WA/OR populations that we aren't including here? Similarly, @Hillsont, do our current methods allow us to detect all origins? For instance, are we analyzing otoliths outside the Grays Basin and are we analyzing any PBT data outside of the Lower Gorge?

This is a good point, and the answer depends on whether we're most interested in predicting p_HOS in the "wild" populations currently represented in the model, or in estimating SAR and the straying matrix for the hatcheries / spawning channels. For the former, it doesn't matter so much if there's leakage of hatchery fish outside the model domain. It's valid to consider the straying matrix as conditional probabilities, given that a hatchery adult returns to our set of populations. True, leakage introduces process error that makes it harder to predict total returns from a hatchery, but that will likely be soaked up by various process error terms. For the latter, though, leakage will obviously bias hatchery SAR downward. Since we'll probably want to account for the potential H/W difference with a dummy covariate anyway, this shouldn't affect the estimates for the wild pops. And conversely, there is very likely leakage from the modeled wild pops to unmodeled ones as well (and vice versa), which is a known unknown that we don't have much hope of estimating.

That said, I'm totally game to incorporate the Elochoman, Skamokawa and the other populations mentioned. Without them in the model, there's not really any coherent way to include known-origin recovery data from them. Last we talked about this, my impression was that the data from those populations are questionable and/or would need some work, but if it's doable I think it would be very worthwhile. Likewise, it would be great to include the OR-side populations. I remember we raised that possibility in our August 2020 project call, but I don't know what challenges might exist to data sharing, or even what data are available.

Saving the easiest for last... @kalebentley, sounds like you aren't getting notifications from this repo. You can change your notification settings here in the upper right:

@Hillsont, what kind of issues are you having logging in to GitHub?

tbuehrens commented 2 years ago

Dispersal kernel honestly doesn't look half bad pooling all pops...That said, lots of issues here: missing pops, uncertainty in kernel, not checking recoveries in all locations in a way that allows assignment to all pops...might it be worth a quick call to discuss these issues and hammer out a direction forward?

ebuhle commented 2 years ago

Just for laughs, I fit a wholly inappropriate beta regression in rstanarm to the pooled dispersal-by-distance data (zeros were changed to 0.001). Logit link for the mean, log link for the dispersion; both show clear relationships with distance. The fitted relationship (line and dark inner 95% credible interval) is actually surprisingly tight, because it's a simple model that trades low variance for high bias. But the lack of fit comes out in the posterior predictive distribution (light outer credible interval), which crashes up against zero so hard it actually throws a bunch of errors in rbeta() and ends up looking weird. The killer is that classic dust bunny distribution in the lower left; it's really more of an upper-quantile relationship with distance than a mean relationship. And of course this doesn't even account for observation error or the nonindependence of the proportions within each origin, which would make the effective sample size even smaller than nominal.

Anyway, I'm game for a call to discuss further. The next week is a bit hectic, but next Mon or Tues could work, depending on the number of scare quotes around "quick". After Thurs 4/7 I'm wide open.

tbuehrens commented 2 years ago

Is it going to slow you down too much to wait until after 4/7? If so maybe we do quick next week and we commit to a timeline (30 min?).

Other thoughts: 1) This really doesn't look bad for a spatial correlation...In the spatial lit they'd call that mess even at dist=0 the "nugget" variance...so in other words any spatial process usually has residual non spatial variance...this process is no different...I'm sure you already know this so maybe I'm just wearing rosy tinted lenses these days 2) slightly different way to do this would be as a multinomial regression which would get away from the need to fudge the 0.01's and also the inflated effective sample size...would need to include a fixed factor for site (in addition to distance) since we expect proportions to vary with site in addition to distance.

ebuhle commented 2 years ago

I'd be happy to have a brief chat early next week if that works for everyone else.

slightly different way to do this would be as a multinomial regression which would get away from the need to fudge the 0.01's and also the inflated effective sample size...would need to include a fixed factor for site (in addition to distance) since we expect proportions to vary with site in addition to distance.

Sure, but that's not as readily available off-the-shelf (although I suppose you could do it in brms), and since this is just an exploratory, illustrative exercise that doesn't relate directly to the IPM, I didn't want to go too far down the rabbit hole. Also a multinomial model would require fudging the continuous expanded abundances to integers. You raise a good point about variation among origins, though; that would add even more uncertainty. The issue with the nugget in this case is the strong negative bias in the residuals (near the origin, and in general). Posterior predictive checks, e.g. the plot above or marginal density, clearly show that excess of small observed values.

I guess my take is that there's not much advantage to trying to shoehorn these sparse data into a parametric kernel model, unless it would enable us to extrapolate that model to the wild populations. IMO that would be an even bigger stretch, for all the statistical reasons plus the structural issues you identified, as well as the general question of whether hatchery straying behavior is representative of wild.

kalebentley commented 2 years ago

I doubt I'll be able to contribute much to the details of this conversation but will make time to attend the proposed meeting. It looks like Thomas, Todd, and I all have time from 9-9:30 AM next Tuesday, April 5th. Do you want me to send out a meeting invite or should we wait until the following week? Monday, April 11th at 2 PM is open for us three as well.

ebuhle commented 2 years ago

Your contributions are always helpful, @kalebentley! Either one of those times would work for me; it's really up to you guys. If we do wait until the week after next, I have plenty of work to keep me busy on moving salmonIPM toward a public release.

tbuehrens commented 2 years ago

Let's wait...I know I'm pretty damn busy and if you aren't at a pivotal juncture (and could use the time for salmonIPM release), then I'd say wait a week.

kalebentley commented 2 years ago

Hey Everyone,

I had intended to write this message immediately after our last IPM check-in on April 11th. Well, that didn't happen and unfortunately, my meeting notes are a bit sparse but figured better late than never and perhaps you all can help fill in details where needed.

Overall, the focus of our meeting was to hash out some details we'd been going through on this thread as to how we would accomplish our primary objective for 2022 of "expanding the IPM to explicitly model releases of hatchery (or known-origin) juveniles and their adult returns and distribution among 'natural' populations."

Prior to work this year, the IPM was able to model the smolt-to-adult portion of a hatchery "grouping's" life cycle but did not include the adult-to-smolt portion. With this stage added (and all the complexity of how to deal with straying), we could start to evaluate some very tangible questions, including:

What is the lifetime "productivity" (i.e., adult-to-adult survival) of natural-origin vs. hatchery-origin recruits? Put another way, are these hatchery programs having a net positive, neutral, or negative effect on populations (i.e., are the adults being removed for broodstock being replaced by returning HORs)?
What is the "trajectory" of populations with and without hatchery supplementation. Currently, this would include fed fry releases in the Grays and Duncan/Gorge populations as well as adult translocations in Duncan.

@ebuhle refresh my memory - will we be able to answer either of these questions by the end of this year's work (June 30th, 2022)? If not, can you provide a brief description of what reporting metrics we'll be able to add to our analysis based on this year's model updates?

Speaking of reporting, quick side note -- we should all discuss what, if any, additional output summaries we want added to the Markdown. For instance, I'd love to see plots of fry vs. spawner with observed data for each population. Also, we may be able to add data/estimates for 2021 adults and 2022 juveniles before our final model run for reporting.

Moving forward, as we outlined in our SOW for next year, we will work to add abundance and compositional data for less intensively monitored Washington populations of Columbia River chum (e.g., Elochoman/Skamokawa). With these data, we can start to run scenarios evaluating the hypothetical effects of hatchery supplementation on other populations and hopefully be able to explore various types of intervention (eggs. unfed fry, fed fry, and translocated adults).

In summary, I want to make sure I have characterized our conversation correctly and make sure we are all on the same page for this year's "deliverables". Also, are there any additional data I need to pull together ASAP to accomplish this year's work? For instance, the current datasets do not adequately detail brood collections, transfers, and subsequent releases to properly characterize in-hatchery (adult to fry) survival. @ebuhle let's chat ASAP if this is something you will need soon.

Thanks, everyone! I'm really excited about where we are heading with our chum IPM!!

Kale

ebuhle commented 2 years ago

Hi @kalebentley, thanks for putting this all in writing. I would slightly amend this bit:

Prior to work this year, the IPM was able to model the smolt-to-adult portion of a hatchery "grouping's" life cycle but did not include the adult-to-smolt portion.

Prior to this year, we haven't explicitly modeled the hatchery life cycle at all; instead we've treated p_HOS in each population and year as an independent parameter, as the other models in salmonIPM do. What we proposed in our conversation on 4/11, and what I'm working on now, is to incorporate that first part of the hatchery life cycle, i.e. smolt release to adult return and spatial dispersal. We talked about doing the second part, i.e. adult broodstock collection to smolt release, in next year's SOW. That would then enable us to answer the two management-focused questions you listed (although it would be possible to address #⁠2 even using the current "non-mechanistic" formulation, by simulating trajectories with p_HOS set to zero).

I've been meaning to give you guys an update on how that smolt-to-adult piece is going. Unfortunately I've gotten a bit bogged down. I was doing some work on an introductory salmonIPM vignette to serve as an on-ramp for Jan, Neala, and others who are new to the package. In the course of doing that, I discovered some issues with the stan_init() function that eventually led to the realization that the initialization of the IPM_LCRchum_pp model is much more fragile than I thought. In the past, there have been occasional cases where the sampler would choke on numerical under/overflow, either at the initial values or early in warmup, and be unable to get off the ground, even though the actual post-warmup sampling and resulting posterior samples all looked fine. It seems that, for whatever reason, the recently updated (2021) data have made this issue much more prevalent. I got lucky the first time I re-fit the model with the new data and so didn't discover this right away. I spent way too long trying to diagnose what part of the model was causing the problem, without any clear resolution. But through trial and error I tweaked the inits such that they're now fairly robust -- when running interactively, I can just restart after the occasional warmup failure.

All that said, it's only been in the last week or two that I've refocused on the hatchery smolt-to-adult submodel (keeping in mind that adding complexity before resolving this inits problem with the current model would be asking for trouble). I'm still hopeful that I can get some version of that up and running by the end of the month. But given the above, I'm thinking that if we do want to update the data with 2021 adults and 2022 juveniles for this year's reporting, as @kalebentley suggests, we'd better do it sooner than later so we have time to iron out any bugs.

If not, can you provide a brief description of what reporting metrics we'll be able to add to our analysis based on this year's model updates?

Assuming I can get this first hatchery stage done, we'll be able to estimate the contribution of each hatchery program to overall p_HOS:

* Adult/spawner abundance by composition (totals and percentages)
  - Origin (i.e., natural/program -- see list below);
Done, with limitations (see, e.g., here). Currently we estimate natural vs. hatchery, except hatchery includes Duncan Channel. Program-specific estimates will come from modeling hatcheries and straying, which is nontrivial but on the agenda.

This would also give us hatchery vs. wild (population / program-specific) SAR:

* SARs by origin  ** (ignoring strays for now) **

  * Hatchery = Released hatchery fry and corresponding HORs (adults) 3-6 years later

Nontrivial. Will need model of hatcheries / straying.

On a related note...

Speaking of reporting, quick side note -- we should all discuss what, if any, additional output summaries we want added to the Markdown. For instance, I'd love to see plots of fry vs. spawner with observed data for each population.

Yes, this would be good to discuss in #10. I do have a working version of these pop-specific spawner-to-fry S-R plots. They're currently not very pretty, on account of high uncertainty in the state estimates corresponding to missing observations, but I can work on making a presentation-ready figure if that's of interest.

Thanks, everyone! I'm really excited about where we are heading with our chum IPM!!

Me too, computational headaches notwithstanding! I'm especially stoked to pull in data from the less intensively monitored populations and to continue building out the hatchery life cycle.