Issues manually specifying jellyfish cpue bins

EmilyMarkowitz-NOAA commented 10 months ago

Issue

tldr: IDW code works when specifying jellyfish cpue bins at set.breaks <- c(0, 249, 683, 1365, 2794, 7200), but not at set.breaks <- c(0, 0.1, 2, 7, [max value]).

In the 2023 NBS Community Highlights document, we publish distribution plots of jellyfish using the akgfmaps R package:

One of the issues regarding this plot I wasn't able to address before publication this year (but hope to in future years) was:

"Have you tried different breaks for jellyfish? This set of breaks doesn't appear to provide much of a contrast and the 2700-7300 spans less than an order of magnitude. I wonder if jellyfish would appear to be widely distributed with high density hotspots if different breaks were used. For example: >0-50, >50-100, >100-1000, >1000-10000, 10000-59236"

You also sent a helpful paper to reference useful breaks. I can't find the paper you referenced, but the breaks were something like set.breaks <- c(0, 0.01, 0.1, 0.5, 1, 2, 3, 7, [max value]).

The Ask: It's a good idea. However, the script took too long (aka failed) to complete when I tried to apply this new break and I had to put this improvement on the backburner. Why is implementing this set.break 'breaking' the script? I also tried using less breaks (e.g., set.breaks <- c(0, 0.1, 2, 7, [max value])) but that didn't seem to help. Here's an example of what I tried:

Reproducible example

Set knowns and pull data

# libraries --------------------------------------------------------------------
library(janitor)
library(dplyr)
library(akgfmaps)
library(ggplot2)

# Knowns -----------------------------------------------------------------------
yrs <-  sort(c(2022, 2023), decreasing = TRUE)
row0 <- 1
spp_print <- "jellyfish"
spp_code <- c(40500, 40501, 40502, 40503, 40504, 40505, 40506, 40507, 40508, 40509, 40510, 40511, 40512, 40513, 40515, 40519, 40520, 40560, 40561)
SRVY1 <- c(98,143)
map.area0 <- "bs.all"
viridis_palette_option <- "mako"
key.title <- paste0(stringr::str_to_sentence(spp_print),
                    ' Weight CPUE (kg/km²)')
reg_dat0 <- akgfmaps::get_base_layers(
  select.region = "bs.all", 
  set.crs = "EPSG:3338")

# Plot function ----------------------------------------------------------------
plot0 <- function(reg_dat0, 
                  stars0, 
                  set.breaks, 
                  key.title){
  figure <- ggplot() +
    geom_stars(data = stars0,
               na.rm = FALSE,
               show.legend = TRUE)  +
    ggplot2::scale_fill_manual(
      name = key.title,
      values =  c("gray90",
                  viridis::viridis(
                    option = viridis_palette_option,
                    direction = -1,
                    n = length(set.breaks)-1,
                    begin = 0.20,
                    end = 0.80)),
      na.translate = FALSE, # Don't use NA
      drop = FALSE) +
    ggplot2::facet_wrap( ~ year, nrow = row0) + 
    coord_sf() +
    ggplot2::guides(
      fill = guide_legend(
        order = 1,
        title.position = "top",
        label.position = "bottom",
        title.hjust = 0.5,
        override.aes = list(color = NA),
        nrow = 1),
      color = "none")  +
    ggplot2::theme(
      legend.position = "bottom",
      legend.box = "horizontal")

  return(figure)
}

# Sign into Oracle -------------------------------------------------------------
library(rstudioapi)
library(RODBC)
channel <- odbcConnect(dsn = "AFSC", 
                       uid = rstudioapi::showPrompt(title = "Username", 
                                                    message = "Oracle Username", default = ""), 
                       pwd = rstudioapi::askForPassword("Enter Password"),
                       believeNRows = FALSE)

# Pull data --------------------------------------------------------------------
## Pull data. Note the format of the `spp_codes` argument with the GROUP column
library(gapindex)
production_data <- get_data(
  year_set = 2022:2023,
  survey_set = c("EBS", "NBS"),
  spp_codes = data.frame(GROUP = 1, SPECIES_CODE = spp_code),
  pull_lengths = FALSE, 
  haul_type = 3, 
  abundance_haul = "Y",
  sql_channel = channel)

# Wrangle data ----------------------------------------------------------------
# Zero-fill and calculate CPUE
production_cpue <- calc_cpue(racebase_tables = production_data)

table_raw <- production_cpue %>% 
  janitor::clean_names() %>% 
  dplyr::mutate(year = substr(x = cruise, start = 1, stop = 4)) %>% 
  dplyr::select(survey_definition_id, year, latitude_dd_start, longitude_dd_start, cpue_kgkm2) %>% 
  dplyr::arrange(desc(year))  %>% 
  dplyr::mutate(common_name = spp_print)

Create IDW stack and plot

Test 1: (auto-prepared jenks)

This code runs quickly and without issue.

EDIT: By working through this issue's reproducible example, I found a different error in my code (um, yay?) so now the plot naturally, using the auto-prepared set.breaks, makes much more sense and provides much better contrast (see figure). Needless to say, I'll be replacing the jellyfish figure that is in the community report. Regardless, I still have the same issue discussed above with implementing the paper-inspired breaks.

# Set Breaks # Option 1: WORKS ------------------------------------------------
# Option 1: WORKS
dat = table_raw; var = "cpue_kgkm2"
(set.breaks <- round(classInt::classIntervals(var = as.numeric(unlist(dat[,var]))[as.numeric(unlist(dat[,var])) != 0], 
                         n = 5, style = "jenks")$brks))
# [1]    0  249  683 1365 2794 7200

# Run make_idw_stack -----------------------------------------------------------
start.time <- Sys.time()
stars0 <- make_idw_stack(
  x = table_raw %>% 
    dplyr::rename(COMMON_NAME = common_name,
                  LATITUDE = latitude_dd_start,
                  LONGITUDE = longitude_dd_start,
                  CPUE_KGHA = cpue_kgkm2),
  region = "bs.all", 
  grouping.vars = "year", 
  set.breaks = set.breaks)
end.time <- Sys.time()
time.taken <- round(end.time - start.time,2)
time.taken

# This restructuring part isn't stricytly necessary, just makes comparing this step and next easier
stars0 <- data.frame(stars0$extrapolation.stack) %>% 
  tidyr::pivot_longer(cols = c(dplyr::starts_with("X2")), 
                      names_to = "year", values_to = "bin") %>% 
  dplyr::mutate(year = as.numeric(gsub(pattern = "X", replacement = "", x = year))) %>% 
  stars::st_as_stars(df, dims = c("x", "y", "year"))

plot0(reg_dat0, stars0, set.breaks, key.title)

Time difference of 42.64 secs

Test 2: (bins specified from paper)

I ran this script over night (several times after restarting) and I can't get it to run past the below point. I don't know why it is getting hung up. Ideas?

# Set Breaks # Option 2: Doesn't work :(  ------------------------------------------------
set.breaks <- c(0, #0.01, 
                  0.1,
                  0.5, 
                  1,#2,3,
                  7,
                  set.breaks[length(set.breaks)]) 

# Run make_idw_stack -----------------------------------------------------------
stars0 <- make_idw_stack(
  x = table_raw %>% 
    dplyr::rename(COMMON_NAME = common_name,
                  LATITUDE = latitude_dd_start,
                  LONGITUDE = longitude_dd_start,
                  CPUE_KGHA = cpue_kgkm2),
  region = "bs.all", 
  grouping.vars = "year", 
  set.breaks = set.breaks)

[inverse distance weighted interpolation] [inverse distance weighted interpolation] |

sean-rohan-NOAA commented 10 months ago

Having trouble connecting to the VPN and will revisit. At first glance, I'd recommend setting make_idw_stack(extrapolation.grid.type = "sf") or make_idw_stack(extrapolation.grid.type = "sf.simple")

Then switch from geom_stars() to geom_sf() in your plot0 function.

sean-rohan-NOAA commented 10 months ago

The breaks in the second figure look way better than the breaks from the community report and I think do a much better job of representing spatial variability. I'm not sure why it's so slow using the small breaks, but the breaks from the paper were for kg/ha, not kg/km2 so your values are off by a factor of 100.

Got my VPN working and this takes ~45 seconds:


start_time <- Sys.time()
jellyfish <- table_raw |>
  dplyr::rename(LONGITUDE = longitude_dd_start,
                LATITUDE = latitude_dd_start,
                CPUE_KGHA = cpue_kgkm2) |>
  make_idw_stack(region = "ebs", 
                 grouping.vars = "year", 
                 set.breaks = c(0, 1, 10, 100, 1000, max(table_raw$cpue_kgkm2)),
                 extrapolation.grid.type = "sf")
end_time <- Sys.time()
end_time-start_time

ggplot() +
  geom_sf(data = jellyfish$extrapolation.stack,
          mapping = aes(fill = var1.pred),
          color = NA) +
  facet_wrap(~year) +
  scale_fill_viridis_d(direction = -1)

EmilyMarkowitz-NOAA commented 10 months ago

Oh! Good catch on converting to kg/ha - can't believe I missed that. Just to be sure, do you still suggest using the c(0, 1, 10, 100, 1000, max(table_raw$cpue_kgkm2)) breaks over the default breaks presented in the issue c(0, 249, 683, 1365, 2794, 7200)? Or is your vote for the latter/default breaks?

sean-rohan-NOAA commented 10 months ago

I'd vote for c(0, 1, 10, 100, 1000, max(table_raw$cpue_kgkm2)).

In general, I would advise against using 'default' breaks because they may not provide a good representation of distribution for a species. The default setting for set.breaks is Jenks simply because that's the default setting for breaks in ArcMap. I think it can be quite hit-or-miss whether Jenks or any other spatially-agnostic method of break selection provides a good representation of density distributions.

For all species, my general recommendation has been to manually specify breaks for each species and use the same breaks every year to facilitate comparison among years, although I understand that can be rather time consuming to do for every species. However, I suspect the breaks from the old tech memos may work reasonably well for quite a few species.

EmilyMarkowitz-NOAA commented 10 months ago

Unfortunately, I don't have the bandwidth right now to find and implement new manual bins for each species for this year's report. Worst case, we can add this to the to-do list for next year's reports. I appreciate that we have the old reports to look back on for these manual bins, but I suspect that those breaks also need review. If you currently have specific recommendations of what bins we should use for a taxon, can you add them here in the dist_bin column? I'll incorporate manual bins if they're there and use jenks if they are not. I already added the new jellyfish bins to the spreadsheet.

EmilyMarkowitz-NOAA commented 10 months ago

Thanks for your help and insight on this!

afsc-gap-products / akgfmaps