afsc-gap-products / gapindex

Calculation of Design-Based Indices of Abundance and Composition for AFSC GAP Bottom Trawl Surveys
https://afsc-gap-products.github.io/gapindex/
Other
4 stars 3 forks source link

AREA_ID = 99902 not returned from calc_biomass_subarea() #42

Closed EmilyMarkowitz-NOAA closed 6 months ago

EmilyMarkowitz-NOAA commented 9 months ago

Issue Description

It appears that the total area calculation for NBS (AREA_ID = 99902) complexes is not being returned by the calc_biomass_subarea function. This works fine for EBS, but just not for NBS. Below I take one fish complex and one invert compex to illustrate following the instructions from the package documentation for preparing complexes.

Is it something that I am doing or something funky with the function? Thanks! Tagging in @sowasser for awareness.

Steps to Reproduce

Sign into oracle

sql_channel <- gapindex::get_connected()

Create complex data

## Pull data. Note the format of the `spp_codes` argument with the GROUP column
library(gapindex)
production_data <- get_data(
  year_set = c(1982:2023),
  survey_set = c("EBS", "NBS"),
  spp_codes = rbind(data.frame(GROUP = "all flatfishes", SPECIES_CODE = c(10000:10399)), 
                    data.frame(GROUP = "neptune welks", SPECIES_CODE = c(71884, 71882))),
  pull_lengths = TRUE, 
  haul_type = 3, 
  abundance_haul = "Y",
  sql_channel = channel)

## Zero-fill and calculate CPUE
production_cpue <- calc_cpue(racebase_tables = production_data)

## Calculate Biomass, abundance, mean CPUE, and associated variances by stratum
production_biomass_stratum <- 
  gapindex::calc_biomass_stratum(racebase_tables = production_data,
                                 cpue = production_cpue)

## Aggregate Biomass to subareas and region
production_biomass_subarea <- 
  calc_biomass_subarea(racebase_tables = production_data, 
                       biomass_strata = production_biomass_stratum)

## Calculate size composition by stratum. Note fill_NA_method == "BS" because
## our region is EBS, NBS, or BSS. If the survey region of interest is AI or 
## GOA, use "AIGOA". See ?gapindex::gapindex::calc_sizecomp_stratum for more
## details. 
production_sizecomp_stratum <- 
  gapindex::calc_sizecomp_stratum(
    racebase_tables = production_data,
    racebase_cpue = production_cpue,
    racebase_stratum_popn = production_biomass_stratum,
    spatial_level = "stratum",
    fill_NA_method = "BS")

## Aggregate size composition to subareas/region
production_sizecomp_subarea <- gapindex::calc_sizecomp_subarea(
  racebase_tables = production_data,
  size_comps = production_sizecomp_stratum)

## rbind stratum and subarea/region biomass estimates into one dataframe
names(x = production_biomass_stratum)[
  names(x = production_biomass_stratum) == "STRATUM"
] <- "AREA_ID"
production_biomass <- rbind(production_biomass_stratum, 
                            production_biomass_subarea)

## rbind stratum and subarea/region biomass estimates into one dataframe
names(x = production_sizecomp_stratum)[
  names(x = production_sizecomp_stratum) == "STRATUM"] <- "AREA_ID"
production_sizecomp <- 
  rbind(production_sizecomp_subarea,
        production_sizecomp_stratum[, names(production_sizecomp_subarea)])

image

Biomass complex data

Here I check the frequency of times a calculation for a SURVEY_DEFINITION_ID, AREA_ID, SPECIES_CODE combination was made for biomass estimates. This should:

  1. Include an entry for 99902 (NBS Total Area), which it does not.
  2. be equal to the number of years there are data for that combination in the resultant data. This appears once duplicated, for some reason. SURVEY_DEFINITION_ID = 98, AREA_ID = 99901, SPECIES_CODE = "all flatfishes" should have 41 years of data, but returns 82. I checked and the duplicates are actually duplicates and removed with a unique() or dplyr::distinct(), as shown in second chunk.
library(dplyr)
production_biomass %>% 
  dplyr::select(SURVEY_DEFINITION_ID, AREA_ID, SPECIES_CODE) %>% 
  table() %>% 
  data.frame() %>% 
  dplyr::filter(Freq != 0) %>% 
  dplyr::arrange(desc(AREA_ID)) %>% 
  head()

image

production_biomass %>% 
  unique() %>% # to address point 2
  dplyr::select(SURVEY_DEFINITION_ID, AREA_ID, SPECIES_CODE) %>% 
  table() %>% 
  data.frame() %>% 
  dplyr::filter(Freq != 0) %>% 
  dplyr::arrange(desc(AREA_ID)) %>% 
  head()

image

Sizecomp complex data

Here I check the frequency of times a calculation for a SURVEY_DEFINITION_ID, AREA_ID, SPECIES_CODE combination was made for sizecomp estimates. I show this because it works as expected:

  1. The resultant table includes AREA_ID = 99902
  2. Rows are not duplicated.
production_sizecomp %>% 
  dplyr::select(SURVEY_DEFINITION_ID, AREA_ID, SPECIES_CODE) %>% 
  table() %>% 
  data.frame() %>% 
  dplyr::arrange(desc(AREA_ID)) %>% 
  head(10)

image

zoyafuso-NOAA commented 9 months ago

Hey you found a bug! Nice. Yes, I can recreate this bug on my side. Short term solution is to loop through regions instead of listing two regions in the get_data call. Medium term solution is the next version of gapindex will have a corrected version of the subarea calculation. After that version is released I will close this issue. Thanks for finding this.

EmilyMarkowitz-NOAA commented 9 months ago

Oh great! Glad it is not me :) Thanks for all of your hard work on this great package and looking forward to the next release!

zoyafuso-NOAA commented 9 months ago

Hey Emily, okay try downloading the dev version of gapindex, rerun the script and let me know if the issues resolves itself. I think I found where the bug lives.

library(devtools)
devtools::install_github("afsc-gap-products/gapindex")

I'll merge it into the main branch if it goes well.

sowasser commented 9 months ago

Hey Zack, I'll do this now. Thank you!!

EmilyMarkowitz-NOAA commented 6 months ago

Fixed!