PIFSC-Protected-Species-Division / LTabundR

R package for design-based line-transect density estimation
https://pifsc-protected-species-division.github.io/LTabundR-vignette/
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

Handling missing group sizes in detection function fitting vs density estimation #10

Closed ericmkeen closed 3 months ago

ericmkeen commented 7 months ago

ABUND's approach: In the event that a sighting occurs during systematic effort but no valid school size estimate is given for that sighting, the ABUND default is to assign a school size of 1. This happens for a handful of sightings in CNP 1986-2020.

LTabundR In LTabunR, we implemented a new approach that we may want to revise. Currently the approach is this: (1) sightings with missing school sizes are flagged and excluded from detection function fitting when LnTotSS is specified as a covariate; (2) during abundance estimation, those sightings are given the average school size for their respective survey.

This is probably the wrong way to go. It is problematic to estimate abundance with a detection function that does not include the sightings used to estimate abundance. But it would also be problematic to exclude the sightings from abundance estimation simply because they were missing data needed to included in the detection function model.

A better option may be somehow interpolating/inferring missing school size before detection function fitting so it can be included.

amandalbradford commented 7 months ago

Fortunately, such sightings have not yet been at play for abundance estimation, but it would be good to work through the best approach. One point of clarification, does average group size assignment in step 2 override the average ESW assignment described in Issue #11?

amandalbradford commented 7 months ago

@ericmkeen - one other clarifying question. For the sightings column ss_valid, which indicates whether a valid best estimate is available, you previously mentioned: "The current system is: If the best is not available, the low estimate is used. If the low is not available either, best is coerced to 1. Based on our notes here it sounds like we want to keep this system and simply update the data (or use coded edits) if we have specific sightings we wish to correct." That sounds good, with the clarifying question that if the low estimate or coerced value of 1 is used, does ss_valid remain FALSE? Is that how those sightings get triggered for step 1 above? Trying to evaluate how this has all come together.

ericmkeen commented 4 months ago

@amandalbradford: to answer your question, "if the low estimate or coerced value of 1 is used, does ss_valid remain FALSE?": yes, ss_valid is FALSE in the case when the low estimate needs to be used as the best estimate AND in the case that the best estimate value is coerced to 1.

ericmkeen commented 4 months ago

@amandalbradford: a proposed solution to this issue as well as issue #11 (a similar question about how to handle missing Bft values), issue #8 (similar question about mixed species sightings with no percentages), and issue #9 (similar question about missing Bft values and impacts on group size calibration):

(1) Within the lta() function, let's not perform any interpolation of any missing values. If rows have missing data for columns that are being used as covariates in the detection function, then those rows are removed and the function presses on. So, if LnSsTot is a candidate covariate, any sighting with ss_valid==FALSE is removed from both detection function fitting and abundance estimation; if Bft is a covariate, any sighting with Bft==NA is removed from both df fitting and abundance estimation. This is even the case for sightings from the focal year of interest for the abundance estimate.

(2) We add a function (working title lta_checks()) that lets you quickly check for missing data in sightings from your focal year. The function can use the same input lists df_settings, fit_filters, and estimates that are provided to the lta() function. It tells the analyst which sightings have missing data, which allows the analyst to prepare coded edits that fill in gaps within the cruz object before they run lta(). This gives the analyst full discretion for how to fill in missing values (e.g., interpolation or some other solution of their own choice); the vignette could then provide examples for how to do this for missing group size estimates and missing Bft values.

I think this solution will simplify the lta() code (and code in other functions too) and help users feel more in control over how missing data are handled.

What do you think?

amandalbradford commented 3 months ago

@ericmkeen - I agree that your proposed solution is the way to go. We are most worried about treatment of sightings in our focal survey/year, as opposed to "imperfect" sightings from previous surveys/years that could join the sightings pool. I like the check function and allowing the user to specify their own correction method. The user will have to be careful to track the potential for "missing" incomplete sightings, but we can make this clear in the vignette. Thank you!

ericmkeen commented 3 months ago

I have implemented this change. In the process I have improved/streamlined the code for determining whether or not a sighting has ss_valid = TRUE or ss_valid = FALSE. Changes were made in process_sightings(), group_size(), and group_size_calibration(),

To make sure the method for assigning the ss_valid status is clear, here is an outline of the workflow for determining it:

For each sighting, loop through each's observer's estimate of group size:

ericmkeen commented 3 months ago

Working on lta_checks() now; once all changes are made I will test the workflow with the WHICEAS analysis.

amandalbradford commented 3 months ago

Hi @ericmkeen - this looks good, but I want to discuss the last bullet. I don't know if we want to totally discount observer estimates for mixed-species groups if they don't include a percentage. Sometimes observers have a total group size, but they don't have a good feel for percentages by species. Since the process has been to average the best estimates, then average the percentages, and then apply the proportions to the estimates, I still think their estimates could be used. The issue here comes when NO observers provide percentages. What do you think?

ericmkeen commented 3 months ago

Thanks @amandalbradford, I think my language was confusing on that last bullet point, so here's an attempt at clarification:

Does this clarify? Very possible I am still confused!

amandalbradford commented 3 months ago

Thanks @ericmkeen - that's great and reflects how we've handled percentages in the past, while offering a better way for flagging ones without (ABUND used to simply remove them).

ericmkeen commented 3 months ago

Sounds good! I implemented these changes and re-ran the WHICEAS analysis to look for any bugs and discrepancies, and everything looks good. Closing this issue.