Are variables used in g3l_distribution_sumofsquares(over) visible in the output somewhere?

MikkoVihtakari commented 1 year ago

I do

g3l_catchdistribution(
    'TrawlNor_sexdist',
    TrawlNor_sexratio %>% rename(stock_re = sex),
    fleets = list(TrawlNor),
    stocks = stocks,
    area_group = c(all = 1),
    g3l_distribution_sumofsquares(over = c('area', 'length')))

Where

TrawlNor_sexratio %>% head()
  year step area    sex length number
1 1981    1  all female  len51      5
2 1981    1  all female  len56      4
3 1981    1  all female  len61     19
4 1981    1  all female  len66     12
5 1981    1  all female  len71     18
6 1981    1  all female  len76     12

I understand that gadget compares the ratio of unique(TrawlNor_sexratio$sex) for each length bin & area combination. When I get my model object, how can I extract the variables used in the g3l_distribution_sumofsquares(over) argument?

I need these for plotting as currently we compare variables across all length groups, not for each length group separately. A bit of a poor example perhaps, but think that there is age in the comparison too. It's these fragmented groups that are compared if I have understood correct?

lentinj commented 1 year ago

over describes which dimensions are included when forming the total that we're taking the ratio of. So, assuming stock has age, area and length. g3l_distribution_sumofsquares(over = c('area')) will do approximately stock__num[age, area, length] / sum(stock_num[, area, ]). g3l_distribution_sumofsquares(over = c('area', 'length')) will do approximately stock__num[age, area, length] / sum(stock_num[, area, length]) (the reality is a bit different since we generally work on whole lengthgroups in one go).

I don't quite follow your second paragraph. There isn't a way of reconstituting over from the model output though. However, I'm guessing what you really want is the value of the sum denominator used reported as we do the obs/model numbers, yes? That's not impossible, but would require a reasonable amount of re-jigging.

MikkoVihtakari commented 1 year ago

I would need it for this function: https://github.com/gadget-framework/gadgetplots/blob/master/R/plot_catchdist.R

It currently compares wrong things if over is used.

lentinj commented 1 year ago

So, if I'm understanding this correctly the nub of the issue is here:-

https://github.com/gadget-framework/gadgetutils/blob/dfa61d7451835575515ea5120c3a516c9c4a73a7/R/g3_fit.R#L117-L118:

                    observed = .data$observed / sum(.data$observed, na.rm = TRUE),
                    predicted = .data$predicted / sum(.data$predicted, na.rm = TRUE),

gadgetutils::g3_fit() is assuming anything called "ldist" is using gadget3::g3l_distribution_sumofsquares(), and semi-replicating it to get the constituent parts, which aren't reported separately. However, it doesn't replicate the over option (and couldn't at present without you feeding this information to g3_fit() yourself).

Broad ways to fix this:-

g3_fit gains it's own over option
An extra report out of g3l_distribution for the total values used within sumofsquares
Moving the over option out to g3l_distribution(), and the existing cdist_sumofsquares_ldist_lln_model__num report becomes the proportion, if used, and absolute values are no longer reported.
Report names that also include over, say cdist_sumofsquaresarealength_ldist_lln_model__num

(1) initially seems most sensible, but I'm loath to add yet more reports (and associated memory costs).

(2) is interesting, and would be the most flexible long-term option. However, it isn't easily backwards-compatible, and would mean that g3_fit() couldn't report both "observed" (the proportion) and "obs" (the absolute value). No idea if both are useful, but I can't find any instances of the latter being used.

(3) feels like cheating, and results in yet longer variable names, but involves a lot less refactoring and is still useful information even if we then do (1) or (2).

Any thoughts @willbutler42 ?

willbutler42 commented 1 year ago

@lentinj My gut feeling would be to go for option 3 and then calculate the 'over' proportion in g3_fit. This seems like the simplest fix and, as you say, will not get in the way of (1) or (2) if we decide to implement those in the future

lentinj commented 1 year ago

@willbutler42 What do you think (1) vs. (2)? It does seem like proportion wants to be a separate setting to whether you use sumofsquares, etc. but I'm a bit worried that only reporting absolute values or proportions will cause problems at some point in the future.

That said, if you're desperate for both you could gather the absolute values separately.

willbutler42 commented 1 year ago

@lentinj Had a chat with Bjarki and we reckon option (2) reporting the modelled and observed proportions is the way to go. Can´t really think of any particular need for the absolute values, and , as you say, they would still be available to gather if required

lentinj commented 1 year ago

Okay, so the (fairly fiddly but not impossible) changes needed would be:-

Apply proportion to obs in likelihood_data() model setup (saving some work during model runs, and so the model input observations and report output are directly comparable)
Extra step within if (done_aggregating_f) { to apply proportion (if required), i.e. the work dist_prop() is currently doing within sumsofsquares
Remove dist_prop() from g3l_distribution_sumofsquares(), although it still needs to set a default for over so I think that option stays and just gets fed back to g3l_likelihood().
Rename variables to avoid mishap, e.g. cdist_sumofsquares_proparealength_ldist_lln_model__numprop, duplicate existing code paths for __num & __wgt.

Obviously once this is done then g3_fit will break as it can't find the reports it needs any more, although with the changed names it could potentially look for both to ease the transition.

MikkoVihtakari commented 1 year ago

I didn't have time trying to understand this issue until now. Stockdist already summarises by length (and age) group: https://github.com/gadget-framework/gadgetutils/blob/dfa61d7451835575515ea5120c3a516c9c4a73a7/R/g3_fit.R#L93

As long as likelihood components do not contain unnecessary information (i.e. age data when age grouping is not desired such as sex and maturity proportions), the g3_fit() aggregation should work correctly, given that I understand this correct.

What I find puzzling, though, is that the model results (black lines) fit so poorly to the data (grey lines) even though the fleet-suitabilities do not indicate anything out of the ordinary (figure here: https://pasteboard.co/dsQMP9Vd4BIq.png).

Are you sure that predicted values from this script are correct model values when using the over argument?

MikkoVihtakari commented 1 year ago

I might have partly fixed this issue with this: https://github.com/gadget-framework/gadgetplots/commit/d42ea7ac28667f72dd38d91d85d30c7a6331aa9c

stockdist and catchdist already contain the required information, but we do not know whether the over argument has been used. As a shortcut I made the function assume that it has if the information are available. For instance, if there are values in the stock column, the plots are split by stocks.

Will this be enough or should we get the variables used in the over argument?

gadget-framework / gadget3

Are variables used in g3l_distribution_sumofsquares(over) visible in the output somewhere? #98