Blubber thickness likelihood component

lentinj commented 1 month ago

A likelihood component / modification that can do:

__len -> LeanWeight
__wgt (i.e. total weight) -> BlubberThickness (TotalWeight = LeanWeight + FatWeight, FatWeight ~ BlubberThickness)

lentinj commented 1 month ago

What will the actual likelihood mechanism look like?

Given current (total) weight we can derive blubberthickness: d_lt = ((W_lt/Wmax_l - 0.5) * 100 - 4.44) / (5693 * (L_t/W_lt)^0.5) (https://github.com/vbartolino/g3_GoB/issues/8#issuecomment-2117058067)

Meeting notes:

Likelihood component, converting weight to blubber thickness
- surveyindicies-style regression after the fact, comparing model aggregations to individuals
- Timeseries of individuals' blubber thickness mapped to model aggregation
- Sum(time) Sum(individuals in data) (data_i - model_agg_closest_to_individual)^2

So the aggregation is quite different. Instead of reading the length/age aggregates from the data, we need to read from the model, and assign observations to the nearest slot.

Ideally we'd then have a table of year/step/length/blubberthickness, and fill in the gaps with mean blubberthickness (assuming the blubber thickness doesn't have age only to help root out edge cases). Use tuple mechanisms?

We then feed both tables into ~g3l_distribution_surveyindices_linear().

lentinj commented 1 month ago

Also related, https://github.com/gadget-framework/gadget3/issues/55 (as we'll potentially have gaps in the timeseries as well as the same problems with retro_years)

lentinj commented 1 month ago

@bthe @vbartolino

I've made a first stab at this locally, it'd be good to see what you make of it before I go much further.

I've made g3l_individual that takes a data frame with (some of) year / step / area / age / length / obs columns. Unlike g3l_distribution we take this data.frame ~directly into the model (we take each column as a vector, and cross our eyes and pretend it's a data.frame). Each step we look for any intersection in the stocks provided, and use transform_f (i.e @vbartolino 's formula above) to turn them into the predicted value.

Finally, we shovel the obs and model vectors into surveyindices, and do linear regression on them much as before.

Here's an example usage:

st <- g3_stock("stst", 1:10) |> g3s_age(3,5)

obs_df <- expand.grid(
    year = 1990:1994,
    age = 3:5,
    bt = 0)
obs_df$bt = seq_len(nrow(obs_df))

g3_to_r(list(
    g3a_time(1990, 1994),
    g3a_otherfood(st),
    gadget3:::g3l_individual(
        "bt",
        obs_df,
        list(st),
        transform_f = g3_formula(
            ((stock_ss(stock__wgt, vec = single)/wmax - 0.5) * 100 - 4.44)
              /
            (5693 * (stock__midlen[[stock__length_idx]]/stock_ss(stock__wgt, vec = single))^0.5),
            wmax = g3_parameterized("wmax", by_stock = TRUE), # TODO: by_length
            end = NULL ),
        function_f = gadget3:::g3l_individual_surveyindices(fit = "linear") ),
    NULL))

...and here's the code that comes out, which looks roughly right:

        wmax <- param[["stst.wmax"]]
        comment("Gather model predictions matching individuals in nll_ind_bt__obs")
        for (nll_ind_bt__row_idx in seq_along(nll_ind_bt__year)) {
            if (nll_ind_bt__year[[nll_ind_bt__row_idx]] != cur_year) next
            age <- nll_ind_bt__age[[nll_ind_bt__row_idx]]
            if (age >= stst__minage && age <= stst__maxage) { stst__age_idx <- age - stst__minage + 1L
                for (stst__length_idx in seq_along(stst__minlen)) {
                  length <- stst__midlen[[stst__length_idx]]

                  nll_ind_bt__model[nll_ind_bt__row_idx] <- nll_ind_bt__model[nll_ind_bt__row_idx] +
                    (((stst__wgt[stst__length_idx, stst__age_idx]/wmax -
                      0.5) * 100 - 4.44)/(5693 * (stst__midlen[[stst__length_idx]]/stst__wgt[stst__length_idx,
                      stst__age_idx])^0.5))
                  nll_ind_bt__model_n[nll_ind_bt__row_idx] <- nll_ind_bt__model_n[nll_ind_bt__row_idx] + 1
                }
            }
        }
        nll_ind_bt__nll <- nll_ind_bt__nll + (if (cur_time != total_steps) 0 else surveyindices_linreg(
                nll_ind_bt__model/nll_ind_bt__model_n,
                nll_ind_bt__obs, NaN, 1)
        comment("Add nll for nll_ind_bt__obs vs nll_ind_bt__model")
        nll_ind_bt__weight <- param[["bt_weight"]]
        nll <- nll + nll_ind_bt__weight * nll_ind_bt__nll[[1]]

Where:

nll_ind_bt__obs is the observed value column from the data frame
nll_ind_bt__model is the predicted value column. Note that this may be the total of several values, e.g. if you don't specify age, only length, or you have both immature/mature stocks. We count the number of values in nll_ind_bt__model_n and take the mean later. Something along these lines could also be a good solution to #55, as in the case of retro years, __model_n will also be 0.

So, am I on the right track? Does this sound generically useful, and any better suggestions for names of g3l_individual & g3l_individual_surveyindices (which is clearly not right, but I'm assuming you'd get what I'd mean at least for now)? Is taking the mean of multiple values the right thing to do?

Plenty of other TODOs:

[x] For Wmax we need to say g3_parameterized(...by_length = TRUE) (EDIT: This is more complex than you'd hope, as whilst for this case it's easy, in general we'd form a lengthgroup vector, which we can't do atm)
[x] Specifying the transform is a bit of a complicated mouthful. At very least some shortcuts to make the formula more obvious would be nice
[x] Actually support NaNs in linreg, skipping values. Use a single definition across this and regular surveyindices.

bthe commented 1 month ago

I think it would be generally useful to have means to include individual measurements, as it opens up the possibility to do direct regression analysis based on model inputs. Regarding the transform function, would it make sense to implement standard regression formulas? @vbartolino what do you think?

vbartolino commented 1 month ago

I wonder, shouldn't a likelihood for the blubber thickness be something like an extension of the CatchStatistics in g2?

Why should I think at them differently? Otherwise, I generally agree with the example above for what I can understand

Is taking the mean of multiple values the right thing to do?

I think so. In the future it might be useful to allow also for some measure of dispersion, i.e. sd or cv? I'm mainly thinking about other type of biological data, ie fish weight...

... any better suggestions for names of g3l_individual

depending on answer to my questions above, it could be g3l_catchstatisticsif we're looking for generality and consistency with g2

I've made g3l_individual that takes a data frame with (some of) year / step / area / age / length / obs columns.

It seems OK to me, especially considering generalisation. In the case of blubber, I think data will be aggregated over all ages

vbartolino commented 1 month ago

I think it would be generally useful to have means to include individual measurements, as it opens up the possibility to do direct regression analysis based on model inputs.

I see the attractiveness of feeding the model with individual data, but isn't it in contrast to the way we approach all the other data components (i.e., growth data, stomachs)? With most fish data the number of data points would be pretty large and I wonder if it could slow calculations. Which are the costs and benefits?

Regarding the transform function, would it make sense to implement standard regression formulas? @vbartolino what do you think?

not sure I fully understand, but it's possibly link to the previous point

lentinj commented 1 month ago

I see the attractiveness of feeding the model with individual data, but isn't it in contrast to the way we approach all the other data components (i.e., growth data, stomachs)?

Yeah, it's very different.

With g3l_distribution (i.e. everything else):

We consider our data dense; we're forming an array of counts for year/step/age/.../length columns. We can't have gaps in the array, any missing points are assumed to be 0
We aggregate the predicted model data based on the observations; if the input data has lengths 1, .. 10, .. 95, then we aggregate the model length data into buckets of [1,10), [10, 95).

With g3l_individual:

The incoming data is sparse; we have a data.frame of values, it's perfectly reasonable to have gaps and irregularly-spaced data.
We round the incoming data to the nearest corresponding data point(s) in the model. If the observation data has lengths 1, 10, 95, then we compare these with the model lengths 0, 10, 90 (assuming the length bins are seq(0, 100, 10)).

As you say, sparse vs. dense has normal storage concerns, but the way the aggregation works is probably the bigger difference.

it could be g3l_catchstatistics if we're looking for generality and consistency with g2

I'm not sure how much the inner workings of g3l_catchstatistics line up with the above (will dig after lunch). But very least, g3l_individual isn't catch statistics, it's overall stock statistics. It wouldn't be impossible to change, but would "mean blubber thickness of an individual length 80" and "mean blubber thickness of an individual length 80 that this fleet caught" be that different?

But including number / mean columns, or offering to aggregate them for you, does make sense though. EDIT: And if we did, the formluae that catchstatistics offers would also make sense.

vbartolino commented 1 month ago

We consider our data dense; we're forming an array of counts for year/step/age/.../length columns. We can't have gaps in the array, any missing points are assumed to be 0

but in practice we often have gaps in the obs_data for instance age-length, isn't it? Does it mean that g3 assumes obs=0 for those combinations? I'm sure there's a good reason why this is needed (though not sure I understand why) but this was not the case with g2, right?

But very least, g3l_individual isn't catch statistics, it's overall stock statistics. It wouldn't be impossible to change, but would "mean blubber thickness of an individual length 80" and "mean blubber thickness of an individual length 80 that this fleet caught" be that different?

Usually we use scientific survey data to infer on the stock biology because supposed to be less biased (ie not influenced by targeting behaviour), but in practice data collected are seldom independent from the sampling mean (gear, fleet, ...).

It's perfectly OK to have it as stock statistics, and in most cases we are interested in biological characteristics of the stock. In the case of blubber thickness, if possible we should use only seals from hunting and filter out animal entangled in nets as (I speculate) they could be biased towards individuals more associated to the gear (e.g., individuals in poor conditions for hunting wild preys or opportunist lazy individuals making an easy life from stealing).

My only thought is that in a general situation gear selectivity can have a influence for instance on a metric like weight at age

lentinj commented 1 month ago

but in practice we often have gaps in the obs_data for instance age-length, isn't it?

I oversimplified a bit, we do allow gaps in age, but generally not in other dimensions:

> gadget3::g3_distribution_preview(data.frame(year = c(1999, 2004), age = c(2, 4), number = c(1, 2)))
, , time = 1999

       age
length  age2 age3 age4
  0:Inf    1   NA   NA

, , time = 2004

       age
length  age2 age3 age4
  0:Inf   NA   NA    2

I think the rationale was in the case of years it's more likely a gap in the data, for ages (EDIT: or lengths) it was more likely an aggregation returning no samples, but this was a long time ago now. @bthe any comments?

Gadget2 is certainly similar. It definitely aggregates model / observations into a fixed structure array (based on the aggregation files), and then pours the data into a zero-initialised array: https://github.com/gadget-framework/gadget2/blob/master/src/catchdistribution.cc#L279-L310 I can't see any conditional that would handle gaps, but I may have missed something.

lentinj commented 1 month ago

I'm not sure how much the inner workings of g3l_catchstatistics line up with the above

Catchstatistics seems to be more g3l_distribution()-esque, pouring obs & model means & counts into arrays based on the aggregations provided. g3l_distribution() can't currently emulate catchstatistics, but it could:

Let you define a measurement formula, which for catchstatistics would be length, us blubber thickness
As well as gather number of individuals (as we already do), gather the sum & squared-sum of the measurement
Use these for catchstatistics-esque comparison functions

All this is very do-able, in fact I'd already made a start on it. But g3l_individual definitely isn't close enough to Catchstatistcs to share a name.

Which approach makes more sense depends on what the data looks like I think.

vbartolino commented 3 weeks ago

Which approach makes more sense depends on what the data looks like I think.

It is not uncommon that modellers have access to already compiled time series of blubber thickness, which is an indicator often used to evaluate the nutritional status of seals and other marine mammals.

Also in the GoB study, we've access to individual data on blubber thickness which at first exploration look good to inform on seasonal variation blubber_thickness_ringedseal_byQ_bySex

but for the long time variation I'll rely on a time-series already available from Kauhala et al 2018 (fig.7a) to be extended with few more points from the individual data if meaningful blubber_ringedseal_matF_Kauhala_etal_2018

So my understanding is that a catchstatistics likelihood more than individual would be relevant in this case. Do you agree?

lentinj commented 3 weeks ago

For the former, do we have length available for individuals too, or just sex/age/step/bt? For how many years does the data exist?

I'm tempted to say we stick with g3l_individual, and add some catchstatistics-style likelihood functions for it.

For both data sets, g3l_individual/g3l_distribution won't make much practical difference. The end result of both will be to aggregate by year,step,age or year. The data sets are small enough that the extra weight of g3l_individual's data.frame won't be a major bother.

But g3l_individual will stop us having to worry about the quarter-3 gaps in data. For g3l_distribution we'd have to paper over the gaps somehow---there was a very old plan of using random effects for this.

It's also less work than modifying g3l_distribution :)

vbartolino commented 3 weeks ago

For the former, do we have length available for individuals too, or just sex/age/step/bt? For how many years does the data exist?

from the individual data length is available. Individual data at hand are from 2007 but they are pretty thin, see number of all individuals per year-quarter

          quarter
    year    1  2  3  4
      2007  0  0  1  1
      2008  0 22  0  3
      2009  0  0  0  4
      2011  0  0  0  3
      2012  0  1  1  4
      2013  0  0  0  5
      2014  0  0  0  3
      2015  0 15  5 16
      2016  0 16  5  3
      2017  0  2  4 10
      2018  1  8  0 17
      2019  0 12 18  3
      2020  0  3  1  6

and here only females which is what we would need for the model

          quarter
year       1 2 3 4
      2007 0 0 1 0
      2008 0 9 0 1
      2009 0 0 0 1
      2011 0 0 0 2
      2012 0 1 0 2
      2013 0 0 0 3
      2014 0 0 0 1
      2015 0 8 4 0
      2016 0 5 1 1
      2017 0 1 1 7
      2018 0 4 0 7
      2019 0 7 6 2
      2020 0 3 1 0

lentinj commented 3 weeks ago

Yeah, g3a_individual would be a better choice. I might rename it to g3a_sparse, which better reflects this.

vbartolino commented 3 weeks ago

Yeah, g3a_individual would be a better choice. I might rename it to g3a_sparse, which better reflects this.

would it allow to use also the already available longer time series in addition to the individual data?

lentinj commented 3 weeks ago

would it allow to use also the already available longer time series in addition to the individual data?

They'll be a separate likelihood component but yes, I think this should work.

lentinj commented 3 weeks ago

I'm definitely going to rename this g3l_sparse or similar. As I've said before, there's 2 aspects of this that makes it different to g3l_distribution; a sparse not dense data structure, and selecting nearest length not aggregating by length.

However, the latter part is a happy accident more than anything. We aggregate in other dimensions, and if we supported notation like "[10, 40)" in the length column we would here too. Length is only weird since it's the only continuous dimension we have.

Anyway, I've got a more catchstatistics-style likelihood function going. Will tidy up and have something by the end of the week.

lentinj commented 3 weeks ago

So, for g2 catchstatistics, sum-of-squares are weighted by 1/variance. But if variance == 0, that datapoint is ignored.

This is a bit problematic for g3. The obvious answer, 1 / avoid_zero(variance), will give data points with no variance a very high weighting, not 0. The other option would be to filter data points where the variance is 0, but I'm not sure we can guarantee that this isn't due to parameter setting, and thus making a function that won't AD properly.

It seems a bit of an odd choice generally, but I guess is less of a practical issue with the amount of data catchstatistics tossed around in g2.

Any thoughts?

bthe commented 3 weeks ago

One way to do the weighting when the variance is set to 0 is simply replace this 0 with a parameter and let the user deal with what this value should be.

lentinj commented 3 weeks ago

We could do something like 1 / ( pmin(user_value, pmax(0.01 -x, 0)) + variance ), although what value you'd get at 0 would be somewhat approximate due to logspace_add.

My current plan is to offer a choice of weighting: 1 / avoid_zero(model_variance), 1 / avoid_zero(obs_variance), or a hard-coded value (read: parameter or 1).

vbartolino commented 3 weeks ago

My current plan is to offer a choice of weighting: 1 / avoid_zero(model_variance), 1 / avoid_zero(obs_variance), or a hard-coded value (read: parameter or 1).

this should cover most possibilities, isn't it?

lentinj commented 3 weeks ago

My current plan is to offer a choice of weighting: 1 / avoid_zero(model_variance), 1 / avoid_zero(obs_variance), or a hard-coded value (read: parameter or 1).

this should cover most possibilities, isn't it?

Although diverging from what g2 does, it seems a reasonable bunch of default choices. It may even be worth skipping the avoid_zero(), and if some of the variances are 0 then it makes it more obvious that you should be providing a weighting yourself.

bthe commented 3 weeks ago

and if some of the variances are 0 then it makes it more obvious that you should be providing a weighting yourself.

I would think so as well, I think the way it was done in gadget2 i.e. unceremonously ignore all entries with 0 variance is certainly not the way to go.

lentinj commented 3 weeks ago

Right, https://github.com/gadget-framework/gadget3/commit/7cd1cb53e62bb223b1b66d8bf57121ec6a67af06 is a reworked version, now called g3l_sparsesample(), with both _linreg() & catchstatistics-style _sumsquares().

To support sumsquares the observation data frame has now columns mean (of the measured value), number (of measurements, defaults to 1 if the column is missing), stddev (of measurements, only used if you use it for weighting).

Remaining things to think about:

Documentation: Hopefully the test is at least a reasonable example, but obviously more is required.
How do to do $W_{max}$: I've still not thought of any good solutions for offering a g3_parameterized(by_length = TRUE), but it could be the case that we don't need it, if for instance $W{max} = p_0 l + p_1$ then we just need to single parameters and it's a non-problem. We still technically have the ability to use parameter vectors, but I suspect they're so unused loads of bugs will come out of the woodwork.
Catch vs. abundance: We currently only support measurements on abundance, not catch data. It's not an impossibility to add, just it'd be nice to get what's there right first.
Vector optimisation: In the case of e.g. per-year measurements, we should vectorize the operation over the whole year rather than working a value at a time. Again, a complication I'm avoiding until we've got the basics right.
Reporting: What needs reporting? Obviously model predictions, but should we offer mean / stddev (we store _sum, _sqsum & _n, as we're doing running totals)? Do we need to reassemble the data.frame to make the vector you're looking at more obvious?

vbartolino commented 1 week ago

Right, https://github.com/gadget-framework/gadget3/commit/7cd1cb53e62bb223b1b66d8bf57121ec6a67af06 is a reworked version, now called g3l_sparsesample(), with both _linreg() & catchstatistics-style _sumsquares().

I'm giving a try to incorporate both the long time series and the individual blubber data. I have a warning (see at the bottom) when I try to use age or length aggregations. Is it relevant?

> g3l_sparsesample(
+       "bt.rin",
+       blubthick.rin,
+       stocks = list(mat_stock),
+       measurement_f = g3_formula(
+             ((stock_ss(stock__wgt, vec = single)/wmax - 0.5) * 100 - 4.44) /
+             (5693 * (stock__midlen[[stock__length_idx]]/stock_ss(stock__wgt, vec = single))^0.5),
+             wmax = g3_parameterized("wmax", by_stock = TRUE), # TODO: by_length
+             end = NULL ),
+      function_f = g3l_sparsesample_sumsquares())
+ $`010:g3l_sparsesample    :bt.rin              :001`
~g3_with(`:=`(wmax, g3_param("rin_mat.wmax", source = "g3l_sparsesample")), 
    {
        debug_label("Gather model predictions matching individuals in nll_spabund_bt.rin__obs")
        for (nll_spabund_bt.rin__row_idx in seq_along(nll_spabund_bt.rin__year)) {
            if (nll_spabund_bt.rin__year[[nll_spabund_bt.rin__row_idx]] < 
                cur_year) 
                next
            if (nll_spabund_bt.rin__year[[nll_spabund_bt.rin__row_idx]] > 
                cur_year) 
                break
            if (nll_spabund_bt.rin__step[[nll_spabund_bt.rin__row_idx]] != 
                cur_step) 
                next
            g3_with(`:=`(length, nll_spabund_bt.rin__length[[nll_spabund_bt.rin__row_idx]]), 
                `:=`(age, nll_spabund_bt.rin__age[[nll_spabund_bt.rin__row_idx]]), 
                if (age >= rin_mat__minage && age <= rin_mat__maxage) 
                  g3_with(`:=`(rin_mat__age_idx, g3_idx(age - 
                    rin_mat__minage + 1L)), `:=`(area, rin_mat__area), 
                    `:=`(rin_mat__area_idx, g3_idx(1L)), for (rin_mat__length_idx in seq_along(rin_mat__minlen)) if (rin_mat__minlen[[rin_mat__length_idx]] <= 
                      length && length < rin_mat__maxlen[[rin_mat__length_idx]]) {
                      g3_with(`:=`(measurement, (rin_mat__num[rin_mat__length_idx, 
                        rin_mat__area_idx, rin_mat__age_idx] * 
                        (((rin_mat__wgt[rin_mat__length_idx, 
                          rin_mat__area_idx, rin_mat__age_idx]/wmax - 
                          0.5) * 100 - 4.44)/(5693 * (rin_mat__midlen[[rin_mat__length_idx]]/rin_mat__wgt[rin_mat__length_idx, 
                          rin_mat__area_idx, rin_mat__age_idx])^0.5)))), 
                        {
                          nll_spabund_bt.rin__model_sum[nll_spabund_bt.rin__row_idx] <- nll_spabund_bt.rin__model_sum[nll_spabund_bt.rin__row_idx] + 
                            measurement
                          nll_spabund_bt.rin__model_sqsum[nll_spabund_bt.rin__row_idx] <- nll_spabund_bt.rin__model_sqsum[nll_spabund_bt.rin__row_idx] + 
                            measurement^2
                          nll_spabund_bt.rin__model_n[nll_spabund_bt.rin__row_idx] <- nll_spabund_bt.rin__model_n[nll_spabund_bt.rin__row_idx] + 
                            1
                        })
                      break
                    }))
        }
    })
<environment: 0x5638266c3c08>

$`010:g3l_sparsesample    :bt.rin              :002`
~{
    nll_spabund_bt.rin__nll <- if (cur_time != total_steps) 
        nll_spabund_bt.rin__nll
    else g3l_sparsesample_sumsquares_stddev(nll_spabund_bt.rin__model_sum/nll_spabund_bt.rin__model_n, 
        nll_spabund_bt.rin__obs_mean, (1/(nll_spabund_bt.rin__model_sqsum/nll_spabund_bt.rin__model_n - 
            (nll_spabund_bt.rin__model_sum/nll_spabund_bt.rin__model_n)^2)) * 
            nll_spabund_bt.rin__obs_n)
    g3_with(`:=`(local_nll, sum(nll_spabund_bt.rin__nll)), {
        debug_label("Add nll for nll_spabund_bt.rin__obs vs nll_spabund_bt.rin__model")
        nll_spabund_bt.rin__weight <- g3_param("bt.rin_weight", 
            optimise = FALSE, value = 1, source = "g3l_sparsesample")
        nll <- nll + nll_spabund_bt.rin__weight * local_nll
    })
}
<environment: 0x563820dec310>

Warning message:
In g3s_sparsedata(c("nll", type = "spabund", name = nll_name), obs_df[,  :
  NAs introduced by coercion

lentinj commented 1 week ago

I wouldn't ever expect a warning. Your example is based on older code though, possibly you need to update gadget3 with remotes::install_github("gadget-framework/gadget3", "likelihood-individual") or so.

The following example seems to at least compile without a warning (note I've used wgt & length in measurement_f as the convenient shorthands it offers):

mat_stock <- g3_stock(c("thing", "mat"), 1:10) |>g3s_age(1,5) |>g3s_livesonareas(1)

blubthick.rin <- data.frame(
    year = 1998,
    step = 1,
    mean = 100 )

x <- g3l_sparsesample(
      "bt.rin",
      blubthick.rin,
      stocks = list(mat_stock),
      measurement_f = g3_formula(
            ((wgt/wmax - 0.5) * 100 - 4.44) /
            (5693 * (length/wgt)^0.5),
            wmax = g3_parameterized("wmax", by_stock = TRUE), # TODO: by_length
            end = NULL ),
     function_f = g3l_sparsesample_sumsquares())

If this doesn't help, what's the structure of blubthick.rin?

vbartolino commented 1 week ago

I mean "when I try to use age or length aggregations" like

blubthick.rin <- a <- mfdb_sample_totalweight(mdb, c('age','length'), list(
    area            = mfdb_group('1'="SD3031"),
    timestep        = mfdb_timestep_quarterly,
    year            = year_range,
    sampling_type   = 'LND',
    species         = 'MAM',
    ## length          = ll,
    age = mfdb_interval('age',c(10,27), open_ended = c('upper')),
    data_source     = 'blubber_timeseries'))[[1]] %>%
    rename(mean = total_weight) %>%
    select(year,step,age,length,mean)

> blubthick.rin[1:5,]
  year step   age length mean
1 1991    2 age10    all 44.3
2 1992    2 age10    all 38.8
3 1993    2 age10    all 36.0
4 1994    2 age10    all 40.2
5 1995    2 age10    all 36.7

lentinj commented 1 week ago

Hrm, I'd not thought of that, apologies. The warning is from trying to convert the ages:

> as.numeric(blubthick.rin$age)
[1] NA NA NA NA NA
Warning message:
NAs introduced by coercion

You could bodge for now:

blubthick.rin$length <- NULL
blubthick.rin$age <- as.numeric(gsub("^age", "", blubthick.rin$age))

I'll have a think about how to get this working properly though. Actual aggregation (i.e. ages 3:5, length [10,20)) will be tricky, but at very least we could handle the easy cases as above.

vbartolino commented 1 week ago

as anticipated, I've two likelihood components for the blubber thickness:

time series of Q2 bt from Kahuala et al (inform on long term changes)
individual data on bt (inform on seasonal pattern)

I have some problem when compiling the model with the individual data. It complains when there are individuals in the same year,step,age with different lengths (ie, year 2019, step 2, age 9)

    year step age length mean
5  2019    2  12  130.0   30
6  2019    2  14  140.0   30
7  2019    3   5  112.0   50
8  2019    3   5  123.0   59
9  2019    3   6  133.0   50
10 2019    3   9  134.5   60
11 2019    3   9  128.0   38
12 2019    3   9  123.0   48
13 2019    4   6  134.5   52
14 2019    4  10  123.0   51

while it compiles when the data are aggregated with averaged lengths

   year step age length     mean
2  2019    2   3  120.0 28.00000
3  2019    2   7  130.0 30.00000
4  2019    2   8  120.0 20.00000
5  2019    2  12  130.0 30.00000
6  2019    2  14  140.0 30.00000
7  2019    3   5  117.5 54.50000
8  2019    3   6  133.0 50.00000
9  2019    3   9  128.5 48.66667
10 2019    4   6  134.5 52.00000
11 2019    4  10  123.0 51.00000

This is the type of error

> obj.fun <- gadget3::g3_tmb_adfun(tmb_model, tmb_param)
g++ -std=gnu++14 -I"/usr/local/lib/R412/lib/R/include" -DNDEBUG -I"/usr/local/lib/R412/lib/R/library/TMB/include" -I"/usr/local/lib/R412/lib/R/library/RcppEigen/include"  -DTMB_SAFEBOUNDS -DTMB_EIGEN_DISABLE_WARNINGS -DLIB_UNLOAD=R_unload_g30_12_1_999_tmb1_9_14_65bb539f4b0bd1a003e890dd3949535125d9b84e  -DTMB_LIB_INIT=R_init_g30_12_1_999_tmb1_9_14_65bb539f4b0bd1a003e890dd3949535125d9b84e  -DTMBAD_FRAMEWORK  -I/usr/local/include   -fpic  -std=gnu++11 -Wno-ignored-attributes -DEIGEN_PERMANENTLY_DISABLE_STUPID_WARNINGS -O3 -flto=auto -march=native -c /tmp/RtmpXUmzQa/g30_12_1_999_tmb1_9_14_65bb539f4b0bd1a003e890dd3949535125d9b84e.cpp -o /tmp/RtmpXUmzQa/g30_12_1_999_tmb1_9_14_65bb539f4b0bd1a003e890dd3949535125d9b84e.o
In file included from /usr/local/lib/R412/lib/R/library/RcppEigen/include/Eigen/Core:164:0,
                 from /usr/local/lib/R412/lib/R/library/RcppEigen/include/Eigen/Dense:1,
                 from /usr/local/lib/R412/lib/R/library/TMB/include/TMB.hpp:92,
                 from /tmp/RtmpXUmzQa/g30_12_1_999_tmb1_9_14_65bb539f4b0bd1a003e890dd3949535125d9b84e.cpp:1:
/usr/local/lib/R412/lib/R/library/RcppEigen/include/Eigen/src/Core/AssignEvaluator.h: In instantiation of ‘void Eigen::internal::call_assignment_no_alias(Dst&, const Src&, const Func&) [with Dst = Eigen::Array<TMBad::global::ad_aug, -1, 1, 0, -1, 1>; Src = Eigen::Array<int, -1, 1>; Func = Eigen::internal::assign_op<TMBad::global::ad_aug, int>]’:
/usr/local/lib/R412/lib/R/library/RcppEigen/include/Eigen/src/Core/PlainObjectBase.h:797:41:   required from ‘Derived& Eigen::PlainObjectBase<Derived>::_set_noalias(const Eigen::DenseBase<OtherDerived>&) [with OtherDerived = Eigen::Array<int, -1, 1>; Derived = Eigen::Array<TMBad::global::ad_aug, -1, 1, 0, -1, 1>]’
/usr/local/lib/R412/lib/R/library/RcppEigen/include/Eigen/src/Core/PlainObjectBase.h:883:7:   required from ‘void Eigen::PlainObjectBase<Derived>::_init1(const Eigen::DenseBase<ElseDerived>&) [with T = tmbutils::vector<int>; OtherDerived = Eigen::Array<int, -1, 1>; Derived = Eigen::Array<TMBad::global::ad_aug, -1, 1, 0, -1, 1>]’
/usr/local/lib/R412/lib/R/library/RcppEigen/include/Eigen/src/Core/Array.h:210:31:   required from ‘Eigen::Array<_Scalar, _Rows, _Cols, _Options, _MaxRows, _MaxCols>::Array(const T&) [with T = tmbutils::vector<int>; _Scalar = TMBad::global::ad_aug; int _Rows = -1; int _Cols = 1; int _Options = 0; int _MaxRows = -1; int _MaxCols = 1]’
/usr/local/lib/R412/lib/R/library/TMB/include/tmbutils/vector.hpp:24:22:   required from ‘tmbutils::vector<Type>::vector(T1) [with T1 = tmbutils::vector<int>; Type = TMBad::global::ad_aug]’
/tmp/RtmpXUmzQa/g30_12_1_999_tmb1_9_14_65bb539f4b0bd1a003e890dd3949535125d9b84e.cpp:2839:113:   required from ‘Type objective_function<Type>::operator()() [with Type = TMBad::global::ad_aug]’
/usr/local/lib/R412/lib/R/library/TMB/include/tmb_core.hpp:1277:7:   required from here
/usr/local/lib/R412/lib/R/library/RcppEigen/include/Eigen/src/Core/util/StaticAssert.h:33:40: error: static assertion failed: YOU_MIXED_DIFFERENT_NUMERIC_TYPES__YOU_NEED_TO_USE_THE_CAST_METHOD_OF_MATRIXBASE_TO_CAST_NUMERIC_TYPES_EXPLICITLY
     #define EIGEN_STATIC_ASSERT(X,MSG) static_assert(X,#MSG);
                                        ^
/usr/local/lib/R412/lib/R/library/RcppEigen/include/Eigen/src/Core/util/XprHelper.h:851:3: note: in expansion of macro ‘EIGEN_STATIC_ASSERT’
   EIGEN_STATIC_ASSERT((Eigen::internal::has_ReturnType<ScalarBinaryOpTraits<LHS, RHS,BINOP> >::value), \
   ^~~~~~~~~~~~~~~~~~~
/usr/local/lib/R412/lib/R/library/RcppEigen/include/Eigen/src/Core/AssignEvaluator.h:888:3: note: in expansion of macro ‘EIGEN_CHECK_BINARY_COMPATIBILIY’
   EIGEN_CHECK_BINARY_COMPATIBILIY(Func,typename ActualDstTypeCleaned::Scalar,typename Src::Scalar);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/lib/R412/lib/R/library/RcppEigen/include/Eigen/src/Core/AssignEvaluator.h: In instantiation of ‘void Eigen::internal::call_assignment_no_alias(Dst&, const Src&, const Func&) [with Dst = Eigen::Array<double, -1, 1>; Src = Eigen::Array<int, -1, 1>; Func = Eigen::internal::assign_op<double, int>]’:
/usr/local/lib/R412/lib/R/library/RcppEigen/include/Eigen/src/Core/PlainObjectBase.h:797:41:   required from ‘Derived& Eigen::PlainObjectBase<Derived>::_set_noalias(const Eigen::DenseBase<OtherDerived>&) [with OtherDerived = Eigen::Array<int, -1, 1>; Derived = Eigen::Array<double, -1, 1>]’
/usr/local/lib/R412/lib/R/library/RcppEigen/include/Eigen/src/Core/PlainObjectBase.h:883:7:   required from ‘void Eigen::PlainObjectBase<Derived>::_init1(const Eigen::DenseBase<ElseDerived>&) [with T = tmbutils::vector<int>; OtherDerived = Eigen::Array<int, -1, 1>; Derived = Eigen::Array<double, -1, 1>]’
/usr/local/lib/R412/lib/R/library/RcppEigen/include/Eigen/src/Core/Array.h:210:31:   required from ‘Eigen::Array<_Scalar, _Rows, _Cols, _Options, _MaxRows, _MaxCols>::Array(const T&) [with T = tmbutils::vector<int>; _Scalar = double; int _Rows = -1; int _Cols = 1; int _Options = 0; int _MaxRows = -1; int _MaxCols = 1]’
/usr/local/lib/R412/lib/R/library/TMB/include/tmbutils/vector.hpp:24:22:   required from ‘tmbutils::vector<Type>::vector(T1) [with T1 = tmbutils::vector<int>; Type = double]’
/tmp/RtmpXUmzQa/g30_12_1_999_tmb1_9_14_65bb539f4b0bd1a003e890dd3949535125d9b84e.cpp:2839:113:   required from ‘Type objective_function<Type>::operator()() [with Type = double]’
/usr/local/lib/R412/lib/R/library/TMB/include/tmb_core.hpp:2031:7:   required from here
/usr/local/lib/R412/lib/R/library/RcppEigen/include/Eigen/src/Core/util/StaticAssert.h:33:40: error: static assertion failed: YOU_MIXED_DIFFERENT_NUMERIC_TYPES__YOU_NEED_TO_USE_THE_CAST_METHOD_OF_MATRIXBASE_TO_CAST_NUMERIC_TYPES_EXPLICITLY
     #define EIGEN_STATIC_ASSERT(X,MSG) static_assert(X,#MSG);
                                        ^
/usr/local/lib/R412/lib/R/library/RcppEigen/include/Eigen/src/Core/util/XprHelper.h:851:3: note: in expansion of macro ‘EIGEN_STATIC_ASSERT’
   EIGEN_STATIC_ASSERT((Eigen::internal::has_ReturnType<ScalarBinaryOpTraits<LHS, RHS,BINOP> >::value), \
   ^~~~~~~~~~~~~~~~~~~
/usr/local/lib/R412/lib/R/library/RcppEigen/include/Eigen/src/Core/AssignEvaluator.h:888:3: note: in expansion of macro ‘EIGEN_CHECK_BINARY_COMPATIBILIY’
   EIGEN_CHECK_BINARY_COMPATIBILIY(Func,typename ActualDstTypeCleaned::Scalar,typename Src::Scalar);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/lib/R412/lib/R/etc/Makeconf:177: recipe for target '/tmp/RtmpXUmzQa/g30_12_1_999_tmb1_9_14_65bb539f4b0bd1a003e890dd3949535125d9b84e.o' failed
make: *** [/tmp/RtmpXUmzQa/g30_12_1_999_tmb1_9_14_65bb539f4b0bd1a003e890dd3949535125d9b84e.o] Error 1
Error in (function (file, flags = "", safebounds = TRUE, safeunload = TRUE,  : 
  Compilation failed

vbartolino commented 1 week ago

wmax can easily be replaced by \$W_{max} = aL^b$ and the two parameters a and b provided but I've no idea how to express the formula by length. Any suggestion?

      measurement_f = g3_formula(
            ((wgt/wmax - 0.5) * 100 - 4.44) /
            (5693 * (length/wgt)^0.5),
            wmax = g3_parameterized("wmax", by_stock = TRUE), # TODO: by_length
            end = NULL ),

lentinj commented 1 week ago

This should do the trick, or am I missing the point?

      measurement_f = g3_formula(
            ((wgt/(wmax.a * length^wmax.b) - 0.5) * 100 - 4.44) / (5693 * (length/wgt)^0.5),
            wmax.a = g3_parameterized("wmax.a", by_stock = TRUE),
            wmax.b = g3_parameterized("wmax.b", by_stock = TRUE),
            end = NULL ),

lentinj commented 1 week ago

It complains when there are individuals in the same year,step,age with different lengths (ie, year 2019, step 2, age 9)

I think the complaint is due to the mean column containing integer values, which isn't what the rest of the code expects. https://github.com/gadget-framework/gadget3/commit/9fd12ee2cf8d8c79b57680d7c66ab8673ac0c2f9 converts on the way in, so this error should go away, with crossed fingers.

Repeated rows should be fine.

lentinj commented 4 days ago

@vbartolino I've just pushed 63196a1, which adds support for catch distribution, basically add predstocks = list(hunt) and it will use the catch data instead of abundance to weight the results. I've reworked the documentation to be a bit more useful too.

lentinj commented 2 days ago

@vbartolino I've pushed 0388651, which should be largely invisible. It vectorizes model prediction calculations where possible. The only real difference should be a speed gain, but if you have problems let me know!

I think the only thing left to think about here is reporting, and what's most useful by default, but that might be best left for Iceland.

vbartolino commented 2 days ago

thanks Jamie for the update. Last week I've been playing with the seal model including the blubber thickness likelihood, weight loss and only otherfood for weight gain. The g3 model compiles and optimises. Despite we need to decide on the type of reporting, it requires some thinking what should be estimated in a model with food-dependent growth (I've no experience as this is not done in stock assessment and it hasn't been tried very much with g2). I'm convincing myself that differently from previous multispecies models I've seen/parametrised, when growth depends on consumption and we've a dataset like blubber thickness to fit we may need to be ready to give some flexibility to the consumption-related parameters and even estimate some in the model. At the moment I'm just exploring and I'll try to have a clearer example asap

gadget-framework / gadget3

Blubber thickness likelihood component #175