Climate zones: check the better data classification based on Koppen-Geiger

OndrejMottl commented 1 year ago

The current classification of 15 climate zones does not work as some zones are still very large.

we need to

[x] find a better way to classify data in order to make the classes have enough data and cover a relatively small area
[ ] rerun all analyses, where classification is used
- general predictor trends
- H1 summary
- H2 analyses

OndrejMottl commented 1 year ago

This is relevant to #98

OndrejMottl commented 1 year ago

I have been looking into the possible ways to make a new classification based on the Koopen-Geiger climate zones.

The GH classification has 3 levels (see wiki), of those we are using 1st (5 classes) and 3rd (30 classes). In addition, John made some alterations to the 2nd (now 15 classes).

I decided to review the process, here is a reproducible script:

#----------------------------------------------------------#
# 0. Setup -----
#----------------------------------------------------------#

library(here)

# Load configuration
source(
  here::here(
    "R/00_Config_file.R"
  )
)

maximum_n_records_to_run_cluster <- 200
optimum_max_n_records_to_run_cluster <- 100
minumum_n_records_to_run_cluster <- 5

#----------------------------------------------------------#
# 2. load data -----
#----------------------------------------------------------#

data_meta <-
  targets::tar_read(
    name = "data_meta",
    store = paste0(
      data_storage_path,
      "_targets_h1"
    )
  )

# we are not interested
data_work <-
  data_meta %>%
  # we are not interested in Africa
  dplyr::filter(region != "Africa") %>%
  # we are want to filter out weird islands in the ocean
  dplyr::mutate(
    is_oceania_outlier = ifelse(region == "Oceania" & long < -100, TRUE, FALSE)
  ) %>%
  dplyr::filter(is_oceania_outlier == FALSE)

#----------------------------------------------------------#
# 3. helper functions -----
#----------------------------------------------------------#

get_classification_plot <- function(data_source) {
  data_source %>%
    dplyr::group_by(region, sel_classification) %>%
    dplyr::summarise(
      .groups = "drop",
      count = n()
    ) %>%
    dplyr::mutate(
      dplyr::across(
        tidyselect::where(
          is.character
        ),
        as.factor
      )
    ) %>%
    ggplot2::ggplot(
      mapping = ggplot2::aes(
        x = reorder(sel_classification, -count),
        y = count,
        fill = sel_classification
      )
    ) +
    ggplot2::facet_wrap(~region) +
    ggplot2::geom_bar(
      stat = "identity"
    ) +
    ggplot2::geom_hline(
      yintercept = maximum_n_records_to_run_cluster
    ) +
    ggplot2::geom_hline(
      yintercept = optimum_max_n_records_to_run_cluster,
      lty = "dashed"
    ) +
    ggplot2::geom_hline(
      yintercept = minumum_n_records_to_run_cluster,
      lty = "dotted"
    ) +
    ggplot2::scale_y_continuous(
      trans = "log",
      breaks = c(0, 1, 3, 5, 10, 20, 50, 100, 150, 200, 300, 500)
    ) +
    ggplot2::theme_bw() +
    ggplot2::theme(
      legend.position = "none",
      axis.text.x = ggplot2::element_text(angle = 90, hjust = 1, vjust = 0.5),
      axis.title.x = ggplot2::element_blank()
    ) +
    ggplot2::labs(
      y = "N records"
    )
}

get_map_plot <- function(data_source) {
  ggplot2::ggplot(
    data = data_source,
    mapping = ggplot2::aes(
      x = long,
      y = lat,
      col = sel_classification
    )
  ) +
    ggplot2::borders(
      fill = "gray90",
      col = NA
    ) +
    ggplot2::geom_point() +
    ggplot2::theme(legend.position = "bottom") +
    ggplot2::coord_quickmap(
      xlim = range(data_source$long),
      ylim = range(data_source$lat)
    ) +
    ggplot2::theme_void()
}

get_classification_map_by_continent <- function(data_source) {
  data_source %>%
    split(.$region) %>%
    purrr::map(
      .f = ~ ggpubr::ggarrange(
        get_classification_plot(.x),
        get_map_plot(.x),
        widths = c(0.5, 1),
        common.legend = TRUE,
        legend = "none"
      )
    ) %>%
    ggpubr::ggarrange(
      plotlist = .,
      common.legend = TRUE,
      legend = "none"
    )
}

There I will decide to select 3 important values:

maximum_n_records_to_run_cluster (200) is the max we can MAYBE run before the cluster cuts us off
optimum_max_n_records_to_run_cluster (100) should run without any issue on the supercomputer
minumum_n_records_to_run_cluster (5) is the minimal number of records we need to sum any results (and build a general trend of the predictor

My goal was to start with the most coarse and only go more fine in climate zones, which are too large.

#----------------------------------------------------------#
# 4. Classification test -----
#----------------------------------------------------------#

# 4.1 --- Koppen-Geiger classification - 5 levels
data_work %>%
  dplyr::mutate(
    sel_classification = ecozone_koppen_5
  ) %>%
  get_classification_map_by_continent()

# 4.2 --- Split the "cold" into:
# - Dry winter
# - Dry summer
# - without dry season
data_work %>%
  dplyr::mutate(
    sel_classification = dplyr::case_when(
      ecozone_koppen_5 == "Cold" ~ ecozone_koppen_15,
      .default = ecozone_koppen_5
    )
  ) %>%
  get_classification_map_by_continent()

# 4.3 --- Split the "cold - without dry season" into:
# - hot summer
# - warm summer
# - cold summer
data_work %>%
  dplyr::mutate(
    sel_classification = dplyr::case_when(
      ecozone_koppen_15 == "Cold_Without_dry_season" ~ ecozone_koppen_30,
      ecozone_koppen_5 == "Cold" ~ ecozone_koppen_15,
      .default = ecozone_koppen_5
    )
  ) %>%
  get_classification_map_by_continent()

I am very happy about this as we split it into manageable classes while only losing a few records in too small climate zones. Thoughts @Vivian-Felde, @SFlantua ?

Vivian-Felde commented 1 year ago

Based on how it looks in Europe and North America, I really like the 4.3 - to me that makes a lot more sense to break up with these smaller zones.

OndrejMottl commented 1 year ago

I am proposing 4.3. The others are just a way of how I got there.

Vivian-Felde commented 1 year ago

I agree with your proposal, to me it looks much improved from what we currently use.

OndrejMottl commented 1 year ago

Here is an overview of how the proposed classification (sel_classiifcation) lays in between KG5 and KG30:

SFlantua commented 1 year ago

I also like 4.3 most. However, all continents except LA have the clusters quite well spatially organized, but in LA the "temperate" cluster is all over the place. Is there anything that can be done there?

OndrejMottl commented 1 year ago

If we want to keep with the GH, then I do not think so. I try to also split the temperate and it does not help with LA. However, it may be a better solution for EU and NA

# 4.4 --- Split also the "temperate" into:
# - Dry winter
# - Dry summer
# - without dry season
data_work %>%
  dplyr::mutate(
    sel_classification = dplyr::case_when(
      ecozone_koppen_15 == "Cold_Without_dry_season" ~ ecozone_koppen_30,
      ecozone_koppen_5 == "Cold" ~ ecozone_koppen_15,
      ecozone_koppen_5 == "Temperate" ~ ecozone_koppen_15,
      .default = ecozone_koppen_5
    )
  ) %>%
  get_classification_map_by_continent()

SFlantua commented 1 year ago

Agree, it's not very helpful for LA. The whole climate zone spatial delimitation is not really working for LA. We will just have to go with it then and hope that it doesnt become a critical points for reviewers and/or readers.

OndrejMottl commented 1 year ago

@Vivian-Felde do you agree with the 4.4? Or should we stick with 4.3?

Vivian-Felde commented 1 year ago

I like 4.4 even better.

HOPE-UIB-BIO / PastHumanImpact

Climate zones: check the better data classification based on Koppen-Geiger #133