[Feature]: Sum to 1 constraint on age comps internally or externally?

ChristineStawitz-NOAA commented 1 year ago

Is your feature request related to a problem? Please describe.

The Fleet module has a number of derived quantity vectors:

catch_at_age
expected_catch
age_composition
catch_numbers_at_age

The Population module also has the following vectors:

catch_weight_at_age
catch_numbers_at_age

Where the Population vectors are just their respective Fleet counterparts, summed over all fleets.

We implemented the Fleet NLL (with @Andrea-Havron-NOAA and @msupernaw ) such that catch_numbers_at_age is assumed to be of double or integer types, we sum-to-one within the NLL function, and we never use the age_composition vector.

I have a couple of questions..

@msupernaw do you think there's any performance hit from doing the sum-to-one as part of the NLL?
Does the stock synthesis team (@iantaylor-NOAA @k-doering-NOAA @kellijohnson-NOAA @Rick-Methot-NOAA ) have any opinions on below options based on experience with supporting internal sum-to-one?

Describe the solution you would like.

Delete age_composition since it is never used
rename expected_catch to catch_weight_at_age to improve clarity & consistency
Delete catch_at_age since it is never used

Describe alternatives you have considered

We could do the sum-to-one within the Fleet or Population modules, store it in age_composition, and then use age_composition in the negative log likelihood
We could delete age_composition and do the sum-to-one within the R input functions, such that we know catch_numbers_at_age always sums to one on input

Statistical validity, if applicable

Multinomial distribution doesn't work if the composition data don't sum to one

Describe if this is needed for a management application

No response

Additional context

No response

iantaylor-NOAA commented 1 year ago

A few reasons that occur to me on why it might be better to rescale internally are

you may want to echo the data as input (without rescaling to sum to 1) as confirmation to the user that it was read correctly (and it can be useful to keep input comp data in numbers of fish, not fractions that add to 1.0--see recent discussion at https://github.com/pfmc-assessments/nwfscSurvey/pull/117)
you may want to add an small constant to the observe and expected proportions to avoid zero values and I think that would be better applied internally

Rick-Methot-NOAA commented 1 year ago

I concur with the consolidation within Fleet to just catch_numbers_at_age and catch_weight_at_age, although catch_weight_at_age is numbers * bodywt and need not be stored as an entity. Once you get to length-based models, fleet-specific bodywt will not be an input data type; instead it will be derived from growth and length-selectivity. I also do not see the value in summing catch_at_age across fleets. There are no logL components that depend on it. It is for reporting only. in SS3, sum to zero happens to the data as first step after it is read; per Ian's comment. For the expected composition, the sum to zero is applied one step before passing the vector to the logL fxn. I think the composition vectors must be double, not integer. There are many reasons why the raw composition obs will get weighted in a way that produces fractions of fish.

msupernaw commented 1 year ago

I don't think we'll notice a performance hit for doing the sum-to-one in the NLL.

msupernaw commented 1 year ago

We can always profile it in MQ, but typically it's the life history methods that consume the most CPU cycles.

ChristineStawitz-NOAA commented 1 year ago

From chatting with @Andrea-Havron-NOAA - In TMB Dmultinom the expected values (p) must sum to one and the observations must be integers (x).

Rick-Methot-NOAA commented 1 year ago

obs must be integers seems very limiting. What is rationale? As I noted earlier, the output of raw data processing programs will weight the raw (integer) observations by various scaling factors as the overall composition of a fleet's catch in a season is assembled. Final output of that process can have fractional fish.

ChristineStawitz-NOAA commented 1 year ago

Sorry for lack of explanation, the above was a note I took during our team meeting. The multinomial distribution in TMB is a copy of the R multinomial distribution, neither of which allow non-integer values because the pdf will only integrate to 1 with the normalizing constant when there are integer values.

This doesn't mean we can't find a way to deal with non-integer data; there are transformation functions in the literature that @Andrea-Havron-NOAA is looking into and will update here, and we can also look into what other TMB-based assessment models do.

msupernaw commented 1 year ago

@ChristineStawitz-NOAA I don't think there is any performance hit for summing in the NLL. This calculation will be on the tape and require the same number of operations regardless of where it is calculated.

timjmiller commented 1 year ago

note that TMB requires that data are not "transformed" in any way on the C++ side for the osa residual machinery to work correctly.

NOAA-FIMS / FIMS