Open MikkoVihtakari opened 1 year ago
over
describes which dimensions are included when forming the total that we're taking the ratio of. So, assuming stock
has age
, area
and length
. g3l_distribution_sumofsquares(over = c('area'))
will do approximately stock__num[age, area, length] / sum(stock_num[, area, ])
. g3l_distribution_sumofsquares(over = c('area', 'length'))
will do approximately stock__num[age, area, length] / sum(stock_num[, area, length])
(the reality is a bit different since we generally work on whole lengthgroups in one go).
I don't quite follow your second paragraph. There isn't a way of reconstituting over
from the model output though. However, I'm guessing what you really want is the value of the sum denominator used reported as we do the obs/model numbers, yes? That's not impossible, but would require a reasonable amount of re-jigging.
I would need it for this function: https://github.com/gadget-framework/gadgetplots/blob/master/R/plot_catchdist.R
It currently compares wrong things if over
is used.
So, if I'm understanding this correctly the nub of the issue is here:-
observed = .data$observed / sum(.data$observed, na.rm = TRUE),
predicted = .data$predicted / sum(.data$predicted, na.rm = TRUE),
gadgetutils::g3_fit
() is assuming anything called "ldist" is using gadget3::g3l_distribution_sumofsquares
(), and semi-replicating it to get the constituent parts, which aren't reported separately. However, it doesn't replicate the over
option (and couldn't at present without you feeding this information to g3_fit
() yourself).
Broad ways to fix this:-
g3_fit
gains it's own over
optiong3l_distribution
for the total values used within sumofsquaresover
option out to g3l_distribution
(), and the existing cdist_sumofsquares_ldist_lln_model__num
report becomes the proportion, if used, and absolute values are no longer reported.cdist_sumofsquaresarealength_ldist_lln_model__num
(1) initially seems most sensible, but I'm loath to add yet more reports (and associated memory costs).
(2) is interesting, and would be the most flexible long-term option. However, it isn't easily backwards-compatible, and would mean that g3_fit
() couldn't report both "observed" (the proportion) and "obs" (the absolute value). No idea if both are useful, but I can't find any instances of the latter being used.
(3) feels like cheating, and results in yet longer variable names, but involves a lot less refactoring and is still useful information even if we then do (1) or (2).
Any thoughts @willbutler42 ?
@lentinj My gut feeling would be to go for option 3 and then calculate the 'over' proportion in g3_fit. This seems like the simplest fix and, as you say, will not get in the way of (1) or (2) if we decide to implement those in the future
@willbutler42 What do you think (1) vs. (2)? It does seem like proportion wants to be a separate setting to whether you use sumofsquares, etc. but I'm a bit worried that only reporting absolute values or proportions will cause problems at some point in the future.
That said, if you're desperate for both you could gather the absolute values separately.
@lentinj Had a chat with Bjarki and we reckon option (2) reporting the modelled and observed proportions is the way to go. Can´t really think of any particular need for the absolute values, and , as you say, they would still be available to gather if required
Okay, so the (fairly fiddly but not impossible) changes needed would be:-
likelihood_data
() model setup (saving some work during model runs, and so the model input observations and report output are directly comparable)if (done_aggregating_f) {
to apply proportion (if required), i.e. the work dist_prop
() is currently doing within sumsofsquaresdist_prop
() from g3l_distribution_sumofsquares
(), although it still needs to set a default for over
so I think that option stays and just gets fed back to g3l_likelihood
().cdist_sumofsquares_proparealength_ldist_lln_model__numprop
, duplicate existing code paths for __num
& __wgt
.Obviously once this is done then g3_fit
will break as it can't find the reports it needs any more, although with the changed names it could potentially look for both to ease the transition.
I didn't have time trying to understand this issue until now. Stockdist already summarises by length (and age) group: https://github.com/gadget-framework/gadgetutils/blob/dfa61d7451835575515ea5120c3a516c9c4a73a7/R/g3_fit.R#L93
As long as likelihood components do not contain unnecessary information (i.e. age data when age grouping is not desired such as sex and maturity proportions), the g3_fit() aggregation should work correctly, given that I understand this correct.
What I find puzzling, though, is that the model results (black lines) fit so poorly to the data (grey lines) even though the fleet-suitabilities do not indicate anything out of the ordinary (figure here: https://pasteboard.co/dsQMP9Vd4BIq.png).
Are you sure that predicted values from this script are correct model values when using the over argument?
I might have partly fixed this issue with this: https://github.com/gadget-framework/gadgetplots/commit/d42ea7ac28667f72dd38d91d85d30c7a6331aa9c
stockdist and catchdist already contain the required information, but we do not know whether the over
argument has been used. As a shortcut I made the function assume that it has if the information are available. For instance, if there are values in the stock column, the plots are split by stocks.
Will this be enough or should we get the variables used in the over argument?
I do
Where
I understand that gadget compares the ratio of
unique(TrawlNor_sexratio$sex)
for each length bin & area combination. When I get my model object, how can I extract the variables used in theg3l_distribution_sumofsquares(over)
argument?I need these for plotting as currently we compare variables across all length groups, not for each length group separately. A bit of a poor example perhaps, but think that there is age in the comparison too. It's these fragmented groups that are compared if I have understood correct?