b-cubed-eu / indicator-uncertainty

Scripts to explore calculation, interpretation and visualisation of uncertainty related to indicators based on biodiversity data cubes
MIT License
0 stars 0 forks source link

LOOCV for species #15

Open wlangera opened 11 hours ago

wlangera commented 11 hours ago

In this test script, we have (at least) one species that heavily influences the indicator (eveness). We can calculate a measure for each species of how much it influences the indicator. Like Leave-One-Out Cross-Validation (LOOCV), but where we leave out one species each time.

We need to think of a good error measure:

Here is some example code to do the Leave-One-Species-Out Cross-Validation. We see that species 3 has a big influence on indicator calculation.

# Load necessary libraries
library(dplyr)
library(ggplot2)
library(purrr)

# Create a more extreme example dataframe with species, year, grid_cell, and value (presence/absence)
# One species (species3) will have consistently higher values
df <- data.frame(
  species = c("species1", "species1", "species2", "species2", "species3", "species3", 
              "species1", "species2", "species3"),
  year = c(2001, 2002, 2001, 2002, 2001, 2002, 2003, 2003, 2003),
  grid_cell = c(1, 1, 2, 2, 3, 3, 1, 2, 3),
  value = c(1, 0, 1, 1, 10, 8, 1, 0, 15)  # Species3 has much higher values (10, 8, 15)
)

# Sample indicator calculation function
calculate_indicator <- function(data) {
  # For example: sum the presence/abundance values for each year
  aggregate(value ~ year, data = data, FUN = sum)
}

# Calculate the "true" indicator (based on all species)
true_indicator <- calculate_indicator(df) %>%
  rename("true_value" = "value")

# Function to perform leave-one-species-out cross-validation
leave_one_species_out_cv <- function(df, f, true_indicator) {
  species_list <- unique(df$species)

  results <- map(species_list, function(species_left_out) {
    # Filter out the data for the species to leave out
    df_subset <- df %>% filter(species != species_left_out)

    # Calculate the indicator for the remaining data
    indicator_result <- f(df_subset) %>%
      rename("indicator_value" = "value")

    # Merge with the true indicator to enable comparison
    comparison_result <- indicator_result %>%
      left_join(true_indicator, by = "year") %>%
      mutate(difference = true_value - indicator_value)

    # Return the result with the species left out
    cbind(species_left_out, comparison_result)
  })

  out_df <- do.call(rbind.data.frame, results) %>%
    arrange(year, species_left_out)

  return(out_df)
}

# Perform leave-one-species-out cross-validation and compare with the true value
cv_results <- leave_one_species_out_cv(df, calculate_indicator, true_indicator)

# Print the results, including the comparison
cv_results
#>   species_left_out year indicator_value true_value difference
#> 1         species1 2001              11         12          1
#> 2         species2 2001              11         12          1
#> 3         species3 2001               2         12         10
#> 4         species1 2002               9          9          0
#> 5         species2 2002               8          9          1
#> 6         species3 2002               1          9          8
#> 7         species1 2003              15         16          1
#> 8         species2 2003              16         16          0
#> 9         species3 2003               1         16         15

# Visualise the results
ggplot(cv_results,
       aes(x = as.factor(year), y = difference, colour = species_left_out)) +
  geom_point(position = position_dodge(width = 0.5), size = 3) +
  labs(x = "Year", y = "Error measure", colour = "Species",
       title = "Influence of Each Species on the Indicator") +
  theme_minimal()

Created on 2024-10-23 with reprex v2.1.1