Edderic / mics-whales

Estimating Birth Rates of Humpback Whales in the Gulf of St. Lawrence
1 stars 0 forks source link

Plausible models #8

Open Edderic opened 5 years ago

Edderic commented 5 years ago

Hey @rsullivan-lord,

I was able to get PyABC set up and working. It looks like it's able to get posteriors that make sense, which is good! Instead of predicting for the whole data set (i.e. 1980 to 2016), I've limited it to period of study (i.e. 2005 to 2016), which will make the inference faster / more accurate.

I've only tried it for the first individual so far. Now I'm going to do inference for all individuals in the data set who were unobserved for at least one of the years in the study. Here's the model I'm thinking of, which is similar to what you've seen so far.

Screen Shot 2019-04-05 at 9 35 14 PM

The U term is some unobserved cause of birth. I have two models so far, one with this U term, and one without it. For the U term we'll use the year as a proxy (e.g. 0 for 2005, 1 for 2006, etc.). Do these make sense? Do you have other models you're thinking of?

Note: I'm not using Years Since Previous Birth as a predictor; I'm now only looking at whether or not whale gave birth the previous year. Reason is partly computational: PyABC as I pointed out in #7 currently doesn't handle discrete variables well yet, and YSPB is discrete. However, it could be thought of as a continuous variable, but the difficulty to write it as such is unknown. Programming its replacement (whale gave birth the year before), on the other hand, is easy. Plus, given the many unknown variables, I feel like removing it shouldn't make a big difference. Having many unknown variables means we can set one of those variables to some value, and other unknown variables would change to fit this new constraint.

Here's what I'm thinking that's left for the computation side of things:

For each individual that has unobservables in 2005-2016, set priors for the 3 or so models that we're thinking of, then find posterior distributions. We then generate samples from the posterior distributions. Find ones that align well enough with the observed data. Once we do this enough times, we can compute credible intervals for birth rate for each year in 2005-2016. I think we'll be able to get this done this weekend, if not next.

rsullivan-lord commented 5 years ago

Ok great, just doing between 2005 and 2016 for now is fine. The model mostly makes sense, although maybe verbally walking through it briefly on the phone would help me.

Also fine to not use YSPB. Maybe we could discuss the difficulty of discrete variables? I'll re-read what you wrote in #7.

Once we have credible intervals for the birth rate for each year is it just a simple calculation to determine average? Do we not care about the average? Is that the whole point of using Bayesian? how do we analyze change over time?

Edderic commented 5 years ago

Ok great, just doing between 2005 and 2016 for now is fine. The model mostly makes sense, although maybe verbally walking through it briefly on the phone would help me.

Also fine to not use YSPB. Maybe we could discuss the difficulty of discrete variables? I'll re-read what you wrote in #7.

Yeah we can totally do that. I'll message you to discuss.

Once we have credible intervals for the birth rate for each year is it just a simple calculation to determine average? Do we not care about the average? Is that the whole point of using Bayesian? how do we analyze change over time?

At the end, once we have the posterior distributions, it is trivial to compute summary statistics (e.g. 95% credible intervals, mean, etc.) That's one benefit. A bigger benefit, IMO, of the Bayesian approach is that we're able to incorporate lots of sources of uncertainty that others weren't able to.

  1. In a given year, there were experienced mothers that were unobserved. Arso Civil made the assumption that if a whale was not observed in a given year, but has given birth the year after or the year before, that the whale did not give birth. We also make that assumption. But how about situations where there are larger gaps?

  2. Similarly, in a given year, there were RAFs that haven't given birth before. Can we say something about birth rate that takes into account these previously excluded RAFs? Arso Civil's estimation of birth rate through Inter-birth interval (IBI) is biased towards experienced mothers, which should mean that estimates of birth rates under that analysis would probably be overestimates.

Bayesian analysis lets us incorporate as much domain knowledge into our estimates, which hopefully would help us model reality more accurately. Once sensible priors and data-generating processes are specified, we let simulation do the heavy lifting to make educated guesses about the parameters of interest. Regarding the first point, unlike in the Arso Civil's paper, we actually make estimates on what had happened during gaps. Those unobserved moments increase uncertainty as to what the real birth rate is. Our Bayesian approach lets us incorporate that uncertainty explicitly. Once we have posterior distributions that fit our assumptions and given data, we could simulate events for that individual whale for unobserved times. In some simulations, the whale would have given birth; in others, the whale would not have given birth.

Once we have the simulations, we are one step closer to getting plausible birth rates for each year. We look at the ratio of births to the number of RAFs in the simulated data sets. Then we imagine different potential birth rates r (e.g. 0.01, 0.02,...,0.99, 1.00). I'm assuming a uniform prior for these potential birth rates. Then, for a given r, we find how likely are we to observe the number of births in the simulated data set. We simulate from this distribution of r values. Once that's done, we have tons of potential values for birth rates. The most plausible ones will be more represented in that collection. Then you can take the 95% most plausible ones and that becomes your 95% credible interval for the birth rate of that year. Take the mean and that becomes the mean estimate for that year. Repeat this for each year of interest, so we'll have estimates of birth rate for each year.