Can make_idw_stack() create shapefiles limited survey area species are found in each year?

EmilyMarkowitz-NOAA commented 10 months ago

Issue

Thanks so much for making the akgfmaps::make_idw_stack() function()! I think this issue can be considered a follow up to https://github.com/afsc-gap-products/akgfmaps/issues/28. Currently, when the function is run, it assumes that survey region is the same for all unique_groups (in this case, years) of data. While this is typically true and there a species will typically inhabit one survey region over time, there are cases when we need to plot over the EBS and NBS regions and the fish may only sometimes be in one of the two regions in a given year. In those cases (e.g., organism A is in the EBS and the NBS in year 1 and only in the EBS in year 2), the function will inapproraitely plot across the two regions in year 2 when the organism wasn't there (first figure). Below I've constructed a reproducible example of my work around for combining these plots when an organism inconsistently exists across the two regions (second figure).

The Ask: Is there a savvy way to incorporate stacking plots in different regions (something like below) into your akgfmaps::make_idw_stack() function()? If not, I have my work around, but if so, great!

General code to pull and prep data for both examples

# Sign into Oracle -------------------------------------------------------------
library(rstudioapi)
library(RODBC)
channel <- odbcConnect(dsn = "AFSC", 
                       uid = rstudioapi::showPrompt(title = "Username", 
                                                    message = "Oracle Username", default = ""), 
                       pwd = rstudioapi::askForPassword("Enter Password"),
                       believeNRows = FALSE)

# libraries --------------------------------------------------------------------
library(janitor)
library(dplyr)
library(akgfmaps)
library(ggplot2)

# Knowns -----------------------------------------------------------------------
yrs <-  sort(c(2021, 2022, 2023), decreasing = TRUE)
row0 <- 1
spp_print <- "longhead dab"
spp_code <- 10211
SRVY1 <- c(98,143)
map.area0 <- "bs.all"
viridis_palette_option <- "mako"
key.title <- paste0(stringr::str_to_sentence(spp_print),
                    ' Weight CPUE (kg/km²)')

# Plot function ----------------------------------------------------------------
plot0 <- function(reg_dat0, 
                  stars0, 
                  set.breaks, 
                  key.title){
  figure <- ggplot() +
    geom_stars(data = stars0,
               na.rm = FALSE,
               show.legend = TRUE)  +
    ggplot2::scale_fill_manual(
      name = key.title,
      values =  c("gray90",
                  viridis::viridis(
                    option = viridis_palette_option,
                    direction = -1,
                    n = length(set.breaks)-1,
                    begin = 0.20,
                    end = 0.80)),
      na.translate = FALSE, # Don't use NA
      drop = FALSE) +
    ggplot2::facet_wrap( ~ year, nrow = row0) + 
    coord_sf() +
    ggplot2::guides(
      fill = guide_legend(
        order = 1,
        title.position = "top",
        label.position = "bottom",
        title.hjust = 0.5,
        override.aes = list(color = NA),
        nrow = 1),
      color = "none")  +
    ggplot2::theme(
      legend.position = "bottom",
      legend.box = "horizontal")

  return(figure)
}

# Pull data --------------------------------------------------------------------
table_raw0 <- RODBC::sqlQuery(channel = channel, 
                              query = paste0(
                                "SELECT 
cc.CPUE_KGKM2, 
cc.SPECIES_CODE, 
cr.SURVEY_DEFINITION_ID, 
cr.YEAR, 
hh.LATITUDE_DD_START, 
hh.LONGITUDE_DD_START
FROM GAP_PRODUCTS.AKFIN_CPUE cc
LEFT JOIN GAP_PRODUCTS.AKFIN_HAUL hh
ON cc.HAULJOIN = hh.HAULJOIN
LEFT JOIN GAP_PRODUCTS.AKFIN_CRUISE cr
ON hh.CRUISEJOIN = cr.CRUISEJOIN
WHERE cc.SPECIES_CODE IN (",paste0(spp_code, collapse = ", "),")
AND cr.SURVEY_DEFINITION_ID IN (",paste0(SRVY1, collapse = ", "),")
AND cr.YEAR IN (",paste0(yrs, collapse = ", "),");")) %>% 
  janitor::clean_names() %>% 
  dplyr::arrange(desc(year))  %>% 
  dplyr::mutate(common_name = spp_print)

# prep example
table_raw <- table_raw0 %>% 
  dplyr::filter(#include both survey regions in 2023
    !(year == 2021 & survey_definition_id == 98) &  # remove EBS from 2021
      !(year == 2022 & survey_definition_id == 143)) # remove NBS from 2022

(table(table_raw[,c("survey_definition_id", "year")])) # check

dat = table_raw; var = "cpue_kgkm2"
set.breaks <- round(classInt::classIntervals(var = as.numeric(unlist(dat[,var]))[as.numeric(unlist(dat[,var])) != 0], 
                                             n = 5, style = "jenks")$brks)

SRVY	2021	2022	2023
98	0	376	376
143	144	0	116

Test 1: Create IDW stack where the IDW interpolates across all regions for all years

# Create IDW stack using all regions for all data ------------------------------
stars0 <- make_idw_stack(
  x = table_raw %>% 
    dplyr::rename(COMMON_NAME = common_name,
                  LATITUDE = latitude_dd_start,
                  LONGITUDE = longitude_dd_start,
                  CPUE_KGHA = cpue_kgkm2),
  region = "bs.all", 
  grouping.vars = "year", 
  set.breaks = set.breaks)

# This restructuring part isn't stricytly necessary, just makes comparing this step and next easier
stars0 <- data.frame(stars0$extrapolation.stack) %>% 
  tidyr::pivot_longer(cols = c(dplyr::starts_with("X2")), 
                      names_to = "year", values_to = "bin") %>% 
  dplyr::mutate(year = as.numeric(gsub(pattern = "X", replacement = "", x = year))) %>% 
  stars::st_as_stars(df, dims = c("x", "y", "year"))

(fig1 <- plot0(reg_dat0, stars0, set.breaks, key.title))

The problem: the shapes bleed into areas they aren't supposed to.

Test 2: Create IDW stack using proper regions for all data

Here I have to

create IDWs seperately for each relevant region
extract full area points from the outputted region-specific stars object as data.frames
bind all of the data.frames and re-star-ify the shapes
plot

This is messy and not ideal, but works! I bet you've got a clever solution for this. If youv've told me in the past how to work this (a bit but not quite in https://github.com/afsc-gap-products/akgfmaps/issues/28), please remind me and sorry in advance!

# find which years need to be prepared with which region # crude, but here we are
(a <- table_raw %>% 
   dplyr::select(year, survey_definition_id) %>% 
   dplyr::distinct()  %>% 
   dplyr::mutate(survey_definition_id = dplyr::case_when(
     survey_definition_id == 98 ~ "EBS", 
     survey_definition_id == 143 ~ "NBS"
   )) %>% 
   tidyr::pivot_wider(id_cols = year, names_from = survey_definition_id, values_from = survey_definition_id) %>% 
   dplyr::mutate(case = dplyr::case_when(
     !is.na(EBS) & !is.na(NBS) ~ "both", 
     !is.na(EBS) ~ "EBS", 
     !is.na(NBS) ~ "NBS"
   )))

# create IDWs seperately for each relevant region ------------------------------
raster_test <- stars0 <- make_idw_stack(
  x = table_raw %>% 
    dplyr::rename(COMMON_NAME = common_name,
                  LATITUDE = latitude_dd_start,
                  LONGITUDE = longitude_dd_start,
                  CPUE_KGHA = cpue_kgkm2) %>% 
    dplyr::filter(year %in% a$year[1]),
  region = map.area0, 
  grouping.vars = "year", 
  set.breaks = set.breaks)$extrapolation.stack %>% 
  data.frame() %>% 
  dplyr::select(x, y)

print("NEBS")
if ("both" %in% a$case) { # spans across EBS and NBS
  raster <- make_idw_stack(
    x = table_raw %>% 
      dplyr::rename(COMMON_NAME = common_name,
                    LATITUDE = latitude_dd_start,
                    LONGITUDE = longitude_dd_start,
                    CPUE_KGHA = cpue_kgkm2) %>% 
      dplyr::filter(year %in% a$year[a$case == "both"]),
    region = "bs.all", 
    grouping.vars = "year", 
    set.breaks = set.breaks)

  # extract full area points from the outputted region-specific stars object as data.frames
  stars0 <- dplyr::bind_cols(
    stars0,
    raster$extrapolation.stack %>% 
      stars::st_extract(at = as.matrix(raster_test)) %>% 
      data.frame()  )
}

print("EBS")
if ("EBS" %in% a$case) { # only in EBS 
  raster <- make_idw_stack(
    x = table_raw %>% 
      dplyr::rename(COMMON_NAME = common_name,
                    LATITUDE = latitude_dd_start,
                    LONGITUDE = longitude_dd_start,
                    CPUE_KGHA = cpue_kgkm2) %>% 
      dplyr::filter(year %in% a$year[a$case == "EBS"]),
    region = "bs.south", 
    grouping.vars = "year", 
    set.breaks = set.breaks)

  stars0 <- dplyr::bind_cols(
    stars0,
    raster$extrapolation.stack %>% 
      stars::st_extract(at = as.matrix(raster_test)) %>% 
      data.frame() )
}

print("NBS")
if ("NBS" %in% a$case) { # only in NBS 
  raster <- make_idw_stack(
    x = table_raw %>% 
      dplyr::rename(COMMON_NAME = common_name,
                    LATITUDE = latitude_dd_start,
                    LONGITUDE = longitude_dd_start,
                    CPUE_KGHA = cpue_kgkm2) %>% 
      dplyr::filter(year %in% a$year[a$case == "NBS"]),
    region = "bs.north", 
    grouping.vars = "year", 
    set.breaks = set.breaks)

  stars0 <- dplyr::bind_cols(
    stars0,
    raster$extrapolation.stack %>% 
      stars::st_extract(at = as.matrix(raster_test)) %>% 
      data.frame() )
}

# bind all of the data.frames and re-star-ify the shapes -----------------------
stars00 <- stars0 %>%
  tidyr::pivot_longer(cols = c(dplyr::starts_with("X2")),
                      names_to = "year", values_to = "bin") %>%
  dplyr::mutate(year = as.numeric(gsub(pattern = "X", replacement = "", x = year))) %>%
  stars::st_as_stars(df, dims = c("x", "y", "year"))

viridis_palette_option <- "mako"
key.title <- paste0(stringr::str_to_sentence(spp_print),
                    ifelse(report_title == "ptpres", " ", '\n'), 
                    'Weight CPUE (kg/km²)')
# plot -------------------------------------------------------------------------
(fig2 <- plot0(reg_dat0, stars0 = stars00, set.breaks, key.title))

year	EBS	NBS	case
2023	EBS	NBS	both
2022	EBS	NA	EBS
2021	NA	NBS	NBS

Properly plots species distributions limited to the survey areas they were found in.

sean-rohan-NOAA commented 10 months ago

Hi Em,

I'm not entirely clear on what you're asking for in terms of development. In the first set of plots, the shapes "bleed" between regions because of the properties of the data. There are no zeros in the region without data and the closest value to the region along the boundary had catch > 0.

The inverse distance weighting functions aren't supposed to 'decide' which data to use for interpolation. It's expected that the user will make the decisions about which data to include in the interpolation. For example, the built-in 2017 YFS data set only includes the SEBS but you could use the data to predict densities in the NBS if you wanted to (certainly not advisable).

If the issue is that the areas don't align between regions after generating predictions for individual regions, you can interpolate the between the EBS and NBS instead then mask to the region of interest. If that's the issue, I'd recommend setting set.crs = "EPSG:3338" to use Alaska AEA for both regions.

Either way, I would strongly advise using simple features outputs instead of stars for generating plots. Code examples below using the built-in YFS data set.

library(akgfmaps)

ebs_layers <- akgfmaps::get_base_layers(select.region = "ebs", 
                                        set.crs = "EPSG:3338")

sample_locations <- sf::st_as_sf(akgfmaps::YFS2017,
                                 coords = c("LONGITUDE", "LATITUDE"),
                                 crs = "WGS84") |>
  sf::st_transform(crs = "EPSG:3338")

idw <- make_idw_stack(x = dplyr::bind_rows(akgfmaps::YFS2017,
                                           dplyr::mutate(akgfmaps::YFS2017, YEAR = 999)), 
                      grouping.vars = "YEAR",
                      out.crs = "EPSG:3338",
                      region = "ebs",
                      extrapolation.grid.type = "sf")

ggplot() +
  geom_sf(data = idw$extrapolation.stack,
          mapping = aes(fill = var1.pred),
          color = NA) +
  geom_sf(data = sample_locations) +
  scale_fill_viridis_d() +
  facet_wrap(~YEAR)

# Note here that the output already includes the survey region. You could filter this to exclude a region without data and it wouldn't affect estimation, etc because IDW is deterministic.
ggplot() +
  geom_sf(data = idw$extrapolation.stack,
          mapping = aes(fill = var1.pred)) +
  facet_wrap(~SURVEY)

EmilyMarkowitz-NOAA commented 10 months ago

Thanks for this explanation. Will work on implementing your suggestion above and let you know if I have any further questions.

afsc-gap-products / akgfmaps