Handling missing group sizes in detection function fitting vs density estimation

ericmkeen commented 7 months ago

ABUND's approach: In the event that a sighting occurs during systematic effort but no valid school size estimate is given for that sighting, the ABUND default is to assign a school size of 1. This happens for a handful of sightings in CNP 1986-2020.

LTabundR In LTabunR, we implemented a new approach that we may want to revise. Currently the approach is this: (1) sightings with missing school sizes are flagged and excluded from detection function fitting when LnTotSS is specified as a covariate; (2) during abundance estimation, those sightings are given the average school size for their respective survey.

This is probably the wrong way to go. It is problematic to estimate abundance with a detection function that does not include the sightings used to estimate abundance. But it would also be problematic to exclude the sightings from abundance estimation simply because they were missing data needed to included in the detection function model.

A better option may be somehow interpolating/inferring missing school size before detection function fitting so it can be included.

amandalbradford commented 7 months ago

Fortunately, such sightings have not yet been at play for abundance estimation, but it would be good to work through the best approach. One point of clarification, does average group size assignment in step 2 override the average ESW assignment described in Issue #11?

amandalbradford commented 7 months ago

@ericmkeen - one other clarifying question. For the sightings column ss_valid, which indicates whether a valid best estimate is available, you previously mentioned: "The current system is: If the best is not available, the low estimate is used. If the low is not available either, best is coerced to 1. Based on our notes here it sounds like we want to keep this system and simply update the data (or use coded edits) if we have specific sightings we wish to correct." That sounds good, with the clarifying question that if the low estimate or coerced value of 1 is used, does ss_valid remain FALSE? Is that how those sightings get triggered for step 1 above? Trying to evaluate how this has all come together.

ericmkeen commented 4 months ago

@amandalbradford: to answer your question, "if the low estimate or coerced value of 1 is used, does ss_valid remain FALSE?": yes, ss_valid is FALSE in the case when the low estimate needs to be used as the best estimate AND in the case that the best estimate value is coerced to 1.

ericmkeen commented 4 months ago

@amandalbradford: a proposed solution to this issue as well as issue #11 (a similar question about how to handle missing Bft values), issue #8 (similar question about mixed species sightings with no percentages), and issue #9 (similar question about missing Bft values and impacts on group size calibration):

(1) Within the lta() function, let's not perform any interpolation of any missing values. If rows have missing data for columns that are being used as covariates in the detection function, then those rows are removed and the function presses on. So, if LnSsTot is a candidate covariate, any sighting with ss_valid==FALSE is removed from both detection function fitting and abundance estimation; if Bft is a covariate, any sighting with Bft==NA is removed from both df fitting and abundance estimation. This is even the case for sightings from the focal year of interest for the abundance estimate.

(2) We add a function (working title lta_checks()) that lets you quickly check for missing data in sightings from your focal year. The function can use the same input lists df_settings, fit_filters, and estimates that are provided to the lta() function. It tells the analyst which sightings have missing data, which allows the analyst to prepare coded edits that fill in gaps within the cruz object before they run lta(). This gives the analyst full discretion for how to fill in missing values (e.g., interpolation or some other solution of their own choice); the vignette could then provide examples for how to do this for missing group size estimates and missing Bft values.

I think this solution will simplify the lta() code (and code in other functions too) and help users feel more in control over how missing data are handled.

What do you think?

amandalbradford commented 3 months ago

@ericmkeen - I agree that your proposed solution is the way to go. We are most worried about treatment of sightings in our focal survey/year, as opposed to "imperfect" sightings from previous surveys/years that could join the sightings pool. I like the check function and allowing the user to specify their own correction method. The user will have to be careful to track the potential for "missing" incomplete sightings, but we can make this clear in the vignette. Thank you!

ericmkeen commented 3 months ago

I have implemented this change. In the process I have improved/streamlined the code for determining whether or not a sighting has ss_valid = TRUE or ss_valid = FALSE. Changes were made in process_sightings(), group_size(), and group_size_calibration(),

To make sure the method for assigning the ss_valid status is clear, here is an outline of the workflow for determining it:

For each sighting, loop through each's observer's estimate of group size:

optional: apply a calibration adjustment to the observer's best estimate; if any part of the calibration adjustment process fails, the original estimates are used (observer ss_valid is still TRUE for the observer-sighting).
if the observer's best estimate is NA or less than 0, the low estimate is used as the best estimate and the observer's ss_valid becomes FALSE.
If the low estimate is NA or less than 0, the best estimate is coerced to 1 and the observer's ss_valid remains FALSE.
After all observer estimates have been processed, filter to estimates where observer ss_valid is TRUE. If at least one observer estimate remains, the sighting's overall ss_valid status is TRUE; if no estimates remain, the overall ss_valid status becomes FALSE and best size estimate becomes NA.
If ss_valid is TRUE after this filter, the remaining observer estimates are used to find the geometric mean (or geometric weighted mean) estimate of group size. If that mean estimate is less than 0 or NA for any reason, overall ss_valid becomes FALSE and the geometric (weighted) mean of the raw low estimates is used. If the low estimate has to be used and it is less than 0 or NA, the best estimate becomes 1.0 and overall ss_valid remains FALSE.
For mixed-species groups, if any species is missing a valid percentage estimate from any of the observers, their best estimate becomes NA and their ss_valid status becomes FALSE.

ericmkeen commented 3 months ago

Working on lta_checks() now; once all changes are made I will test the workflow with the WHICEAS analysis.

amandalbradford commented 3 months ago

Hi @ericmkeen - this looks good, but I want to discuss the last bullet. I don't know if we want to totally discount observer estimates for mixed-species groups if they don't include a percentage. Sometimes observers have a total group size, but they don't have a good feel for percentages by species. Since the process has been to average the best estimates, then average the percentages, and then apply the proportions to the estimates, I still think their estimates could be used. The issue here comes when NO observers provide percentages. What do you think?

ericmkeen commented 3 months ago

Thanks @amandalbradford, I think my language was confusing on that last bullet point, so here's an attempt at clarification:

If multiple percentage estimates are available for a species in a mixed-species group, the average of those percentage estimates will be used.
If only a single percentage estimate is available for a species in a mixed-species group, that single percentage estimate will be used (ss_valid will not be changed.)
If no percentage estimate is available for a species in a mixed-species group, then the best estimate for that species will become NA and ss_valid will become FALSE if it is not already.

Does this clarify? Very possible I am still confused!

amandalbradford commented 3 months ago

Thanks @ericmkeen - that's great and reflects how we've handled percentages in the past, while offering a better way for flagging ones without (ABUND used to simply remove them).

ericmkeen commented 3 months ago

Sounds good! I implemented these changes and re-ran the WHICEAS analysis to look for any bugs and discrepancies, and everything looks good. Closing this issue.

PIFSC-Protected-Species-Division / LTabundR

Handling missing group sizes in detection function fitting vs density estimation #10