abundance distribution nll's equal to zero

willbutler42 commented 2 years ago

If the year range of the obs_data argument to g3l_abundancedistribution does not match the model's year range (specified in g3a_time), the nll for the abundancedistribution component is zero.

For example, in demo-ling, fitting a model with year_range <- 1982:2020 to data from 1982:2021 (demo_ling_read_data <- setdefault("demo_ling_read_data", FALSE)) returns nll's of zero for abundance distributions. But if the same model is fit to data from 1982:2020 (demo_ling_read_data <- setdefault("demo_ling_read_data", TRUE)), reasonable nll's are returned.

lentinj commented 2 years ago

Does this produce the same if you e.g. set year_range <- 1982:2010 and demo_ling_read_data <- FALSE? Do the model and observation arrays look as you'd expect in both cases?

The obvious answer would be that the nll array is bigger than the time period for the model, and the final parts never get initialised. But it's not the case. The nll array should match the number of model steps, not the input data. And the extra year should get unused, but I'm missing something somewhere.

bthe commented 1 year ago

This issue seems to pop up as well when retro_years > 0, and seems to stem from the time of the evaluation being hard coded by the data (see https://github.com/gadget-framework/gadget3/blob/master/R/likelihood_distribution.R#L106 ). Any ideas on how circumvent this?

lentinj commented 1 year ago

https://github.com/gadget-framework/gadget3/blob/f7585beda4e12d0e854c6abe4c7e315d62b20e61/R/likelihood_distribution.R#L106

I'm guessing that needs to be cur_step_final || modelstock__time_idx != modelstock__max_time_idx, but there could be a sublety to this I've forgotten right now.

bthe commented 1 year ago

Doing

cur_year != (end_year - retro_years) && modelstock__time_idx != modelstock__max_time_idx

seems to do the job.

bthe commented 1 year ago

Well, this doesn't quite do the job. When model output is passed to the linear regression the full observation matrix is compared to the reduced model output. Any ideas on how fix this?

lentinj commented 1 year ago

Hrm, yes. It's obvious now you point it out :)

Ideally we need to subset model/obs by the the time periods that actually happened, rather than just "all time". Unfortunately TMB doesn't let you do the equivalent of __obs[time = 1:10]. Implementing .col(from, to) wouldn't be very hard though, but depends on how receptive they are to the idea.

Assuming we can do this, forming the from/to indexes doesn't have an obvious solution, but could be bodged relatively quickly at least.

lentinj commented 1 year ago

The other way you might think about this is resizing the arrays to be the intersection of the data and the model run time. Unfortunately that's a bigger TMB no-no.

Side note, TMB has just gained TMB_UPDATE(), allowing data to vary without having to re-tape (~rebuild) the objective function. However, it carries a big warning not to re-dimension the data, which makes it not very useful for our purposes.

lentinj commented 1 year ago

Reverted https://github.com/gadget-framework/gadget3/commit/a593a5fe445cc40a044101b7ce44bcc7f7e49f70 since it doesn't really help and breaks tests.

Once the above is done, then there's probably a better way of doing similar.

lentinj commented 1 year ago

Entirely different way of thinking about this: The problem is actually that the model array has 0 for retro'ed out years, and it should be NaN. Even ignoring surveyindices this seems like it'd be a sensible change, useful for gadgetplots etc. to tell the difference between no fish and nothing happened.

Then the problem comes a lot easier, we do something like the above, then filter NaN when calculating the mean in surveyindices_linreg().

lentinj commented 6 months ago

Another problem with surveyindices that cropped up with #142 is that the parameter reporting isn't broken down:

https://github.com/gadget-framework/gadget3/blob/7d433b75c029af8a787c23e54832b2ac5d580c33/tests/test-likelihood_distribution-surveyindices.R#L52-L55

Whilst the end result is fine, the report is missing an age dimension (in this case).

We need to:

[ ] Init modelstock__x to NA, extra step that does modelstock__x[time = cur_step] <- 0 (this wouldn't handle e.g. areas we don't compare, but that's probably okay). Will also need to worry about cur_step == 1 if aggregating by year.
[ ] Give surveyindices a special comparison step at the end of the model run, along with reporting
[ ] Filter NAs in surveyindices_linreg(). Doing this in a CppAD-compatible fashion will be interesting, possibly NaN propogation, isNAN & select
[ ] Generate 2 arrays for alpha & beta, broken down by modelstock - time.

EDIT: Whilst in general an if on the aggregated model output is a big no-no, in this case the positions of the NaNs will be static within a tape run. This makes it less scary.

gadget-framework / gadget3

abundance distribution nll's equal to zero #55