How to automate the script across all species

RetoSchmucki / rbms

A home for the latest rbms R package

https://retoschmucki.github.io/rbms/

Other

4 stars 4 forks source link

How to automate the script across all species #19

Open SarahVray opened 1 year ago

SarahVray commented 1 year ago

Dear Reto @RetoSchmucki,

I am trying to automate your script from https://retoschmucki.github.io/rbms/articles/Get_Started_1.html and https://retoschmucki.github.io/rbms/articles/Get_Started_2.html to run it across all species at once.

For this I am creating a "for" loop at each step of the script, e.g (with LUBMS_count = the count table): species.names <- unique(LUBMS_count[order(SPECIES), SPECIES]) for(i in 1:length(species.names)){ ts_season_count <- rbms::ts_monit_count_site(ts_season_visit, LUBMS_count, sp = species.names[i]) saveRDS(ts_season_count, file.path("K:/BIODIV/BIOLYS/3_Working area/WP2 - butterfly trends/Working directory/1-ALL transects/ts_season_count", paste(gsub(" ", "_", species.names[i]), "ts_season_count.rds", sep = "_"))) } This command works fine but not for all steps. For example, when it comes to plot the flight curve, the NAs from the field NM are causing an error message ending the loop, and I don't know how to handle this. for(i in 1:length(species.names)){ ts_flight_curve <- readRDS(file.path("K:/BIODIV/BIOLYS/3_Working area/WP2 - butterfly trends/Working directory/1-ALL transects/flight curve", paste(gsub(" ", "_", species.names[i]), "pheno.rds", sep = "_"))) pheno <- ts_flight_curve$pheno yr <- unique(pheno[order(M_YEAR), as.numeric(as.character(M_YEAR))]) if("trimWEEKNO" %in% names(pheno)){ plot(pheno[M_YEAR == yr[1], trimWEEKNO], pheno[M_YEAR == yr[1], NM], type = 'l', ylim = c(0, max(pheno[, NM])), xlab = 'Monitoring Week', ylab = 'Relative Abundance', lwd=3, main=paste(species.names[i], "flight curve")) } else { plot(pheno[M_YEAR == yr[1], trimDAYNO], pheno[M_YEAR == yr[1], NM], type = 'l', ylim = c(0, max(pheno[, NM])), xlab = 'Monitoring Day', ylab = 'Relative Abundance', lwd=3, main=paste(species.names[i], "flight curve")) } if(length(yr) > 1) { j <- 2 for(y in yr[-1]){ if("trimWEEKNO" %in% names(pheno)){ points(pheno[M_YEAR == y , trimWEEKNO], pheno[M_YEAR == y, NM], type = 'l', col = i, lwd=3) } else { points(pheno[M_YEAR == y, trimDAYNO], pheno[M_YEAR == y, NM], type = 'l', col = i, lwd=3) } j <- j + 1 }} legend('topright', legend = c(yr), col = c(seq_along(c(yr))), lty = 1, bty = 'o', lwd = 3, title="Year") } => Error in plot.window(...) : need finite 'ylim' values

There are two cases:

For species with only some years with no flight curve, then the script should ignore these years and plot only the years with a flight curve available. There would be a simple (but not optimal) solution, just by creating a new "pheno" where rows with NAs are removed (see issue #18 ).
For species where not a single year has a flight curve, then the script should just ignore this species and go to the next species. => this is where I am stuck.

I have this issue at three steps in the script:

To plot the annual flight curves
To calculate the species trend (from "workshop_functions.R", https://butterfly-monitoring.github.io/bms_workshop/species_trends.html)
To plot the annual collated indices with trend line

Would you know how to adapt the script to resolve this issue? I guess the statement "if next" would be helpful here but I have no clue on how to correctly apply it as my loop is built on species names and not on NM values, I cannot find examples on the web similar to the case here.

Thank you very much in advance for your help !

Sarah

RetoSchmucki commented 1 year ago

@SarahVray Here is a solution that will deal with both cases, using the function next() to skip when if(length(yr)<1) is TRUE. I also added some !is.na(NM) in the plot to avoid the error of the undefined y limits. You can use the same approach for all your loops.

yr <- unique(pheno[order(M_YEAR),][!is.na(NM), as.numeric(as.character(M_YEAR))])

if(length(yr) < 1) next()  ## skip species[i] and move to next

if("trimWEEKNO" %in% names(pheno)){
plot(unique(pheno[M_YEAR == yr[1], .(trimWEEKNO, NM)]), type = 'l', ylim = c(0, max(pheno[!is.na(NM), ][, NM])), xlab = 'Monitoring Week', ylab = 'Relative Abundance', lwd=3, main=paste(species.names[i], "flight curve"))
} else {
plot(unique(pheno[M_YEAR == yr[1], .(trimDAYNO, NM)]), type = 'l', ylim = c(0, max(pheno[!is.na(NM), ][, NM])), xlab = 'Monitoring Day', ylab = 'Relative Abundance', lwd=3, main=paste(species.names[i], "flight curve"))
}
if(length(yr) > 1) {
j <- 2
for(y in yr[-1]){
if("trimWEEKNO" %in% names(pheno)){
pointsunique(pheno[M_YEAR == y , .(trimWEEKNO, NM)]), type = 'l', col = i, lwd=3)
} else {
points(unique(pheno[M_YEAR == y , .(trimDAYKNO, NM)]), type = 'l', col = i, lwd=3)
}
j <- j + 1
}}
legend('topright', legend = c(yr), col = c(seq_along(c(yr))), lty = 1, bty = 'o', lwd = 3, title="Year")
}

SarahVray commented 1 year ago

Thank you very much @RetoSchmucki, it works for plotting the curves. However I still have an issue when calculating the species trends (from "workshop_functions.R", https://butterfly-monitoring.github.io/bms_workshop/species_trends.html). It efficiently skips species without any year available, but it gets stuck at some point, when it gets to the species Argynnis adippe_collated index.csv for which only the flight curve of 2021 could be fitted. Here is my code:

species.names <- unique(LUBMS_count[order(SPECIES), SPECIES])
for(i in 1:length(species.names)){
  co_index <- readRDS(file.path("K:/BIODIV/BIOLYS/3_Working area/WP2 - butterfly trends/Working directory/1-ALL transects/co_index_boot", paste( gsub(" ", "_", species.names[i]), "co_index_boot.rds", sep="_")))
  yr <- unique(co_index[order(M_YEAR),][!is.na(NSITE), as.numeric(as.character(M_YEAR))])
  if(length(yr) < 1) next()  ## skip species[i] and move to next

# Convert to log 10 collated indices (TRMOBS):
  co_index[, LOGDENSITY:= log(COL_INDEX)/log(10)][, TRMOBS := LOGDENSITY - mean(LOGDENSITY) + 2, by = .(BOOTi)]
  # add the trims information
 # Estimate and classify species trends with a CI based on the bootstraps:
sp_trend <- estimate_boot_trends(co_index)
trend_path <- file.path("K:","BIODIV","BIOLYS","3_Working area","WP2 - butterfly trends","Working directory","1-ALL transects", "trend",paste(species.names[i], "trend.csv", sep="_"))
write.csv(sp_trend, file=trend_path, row.names=T)
}

And the error message I get: Error in if (sum(trend_ci[, low] > 1) > 0) trend_ci[trend_ci[, low] > : missing value where TRUE/FALSE needed Called from: trend_class_func(trends_ci, tcol = "TrendClass_lt")

Is it because we need a minimum number of years (3 if I remember well) with a flight curve to be able to calculate the trend? EDIT: Actually I don't think the error message comes from this, because for Aporia crataegi I only have a flight curve for 2020 and the script still calculates a trend without an error message. But the difference is that for this species a COL_INDEX was calculated for each year (not a single null COL_INDEX), but for A. adippe COL_INDEX is null from 2010 to 2014.

RetoSchmucki commented 1 year ago

Hi @SarahVray Thank you for bringing this case to my attention. You are dealing with a species that was not recorded before it colonised your monitoring sites and probably started to spread. The result is a series of zeros followed by increasing counts.

If you run

co_index[, LOGDENSITY:= log(COL_INDEX)/log(10)][, TRMOBS := LOGDENSITY - mean(LOGDENSITY) + 2, by = .(BOOTi)]

You will have NA or -Inf in your TRMOBS, since log(0) = -Inf. There are several solutions, but the simplest and probably the most useful in calculating trends is to estimate the trend from the year of colonisation, removing all trailing zeros before colonisation. You need to be aware that the trend after a colonisation event will be remarkably high, simply because of the nature of the colonisation process. Some might say that you could add a small number to the zeros to avoid log(0) = -Inf. However, I think it makes more sense to exclude the trailing zeros and calculate the trend of an existing (established) population.

Here is your fix:

source("workshop_functions.R")

co_index <- data.table::fread("Argynnis.adippe_collated.index.csv")

## Only use COL_INDEX larger then 0 for calculation or logdensity and trmobs
co_index[COL_INDEX > 0, LOGDENSITY := log(COL_INDEX) / log(10)][
               COL_INDEX > 0, TRMOBS := LOGDENSITY - mean(LOGDENSITY, rm.na = TRUE) + 2, 
               by = .(BOOTi)]

## ==============
##  trend_func()
## find first and last year without zeros
## collind_df[!is.na(TRMOBS), min(M_YEAR)]
## collind_df[!is.na(TRMOBS), max(M_YEAR)]
## ==============

# Underlying function for (linear) trend calculation
trend_func <- function(collind_df, bycols = NULL, minyear = NULL, maxyear = NULL) {
    setDT(collind_df)
    if (is.null(minyear)) minyear <- collind_df[!is.na(TRMOBS), min(M_YEAR)]
    if (is.null(maxyear)) maxyear <- collind_df[!is.na(TRMOBS), max(M_YEAR)]
    if (!is.null(minyear)) collind_df <- collind_df[M_YEAR >= minyear]
    if (!is.null(maxyear)) collind_df <- collind_df[M_YEAR <= maxyear]

    if (nrow(collind_df) > 0) {
        # Fit linear model
        lm_obj <- try(lm(TRMOBS ~ M_YEAR, collind_df), silent = TRUE)

        trend_df <- data.frame(
            BOOTi = collind_df$BOOTi[1],
            data_from = collind_df[!is.na(TRMOBS), min(M_YEAR)],
            data_to = collind_df[!is.na(TRMOBS), max(M_YEAR)],
            data_nyears = length(collind_df$M_YEAR),
            minyear = minyear,
            maxyear = maxyear,
            n = length(minyear:maxyear),
            rate_lt = ifelse(class(lm_obj) != "try-error",
                exp(coef(lm_obj)[2] * 2.303), NA
            ),
            pc1_lt = ifelse(class(lm_obj) != "try-error",
                100 * (exp(coef(lm_obj)[2] * 2.303) - 1), NA
            ),
            pcn_lt = ifelse(class(lm_obj) != "try-error",
                100 * (exp(coef(lm_obj)[2] * 2.303)^(
                    maxyear - minyear) - 1), NA
            )
        )

        # Fit for last 10 years only if data have more than 10 years
        if (length(minyear:maxyear) > 10) {
            collind_df10 <- collind_df[M_YEAR >= (maxyear - 9)]
            lm_obj10 <- try(lm(TRMOBS ~ M_YEAR, collind_df10), silent = TRUE)

            trend_df$rate_10y <- ifelse(class(lm_obj10) != "try-error",
                exp(coef(lm_obj10)[2] * 2.303), NA
            )
            trend_df$pc1_10y <- ifelse(class(lm_obj10) != "try-error",
                100 * (exp(coef(lm_obj10)[2] * 2.303) - 1), NA
            )
            trend_df$pcn_10y <- ifelse(class(lm_obj10) != "try-error",
                100 * (exp(coef(lm_obj10)[2] * 2.303)^(
                    9) - 1), NA
            )
        }

        if (!is.null(bycols)) {
            for (i in 1:length(bycols)) {
                trend_df[, bycols[i]] <- collind_df[1, get(bycols[i])]
            }
        }

        return(trend_df)
    } else {
        return(NULL)
    }
}

## calculate trend
sp_trend <- estimate_boot_trends(co_index, minyear = 2017)

Try this out and let me know if it works. I will update workshop_functions.R to include this fix later this week.

SarahVray commented 1 year ago

Hi @RetoSchmucki, thanks very much for your reply. Indeed this species was not recorded before 2017 (not because it colonized our monitoring sites and started to spread but because we started new transects in dry grasslands in 2016 and then in 2021, so it is clearly a bias that we will have to fix). I ran your code (with the modification in workshop_function) and it seems to work, although it gives me an "uncertain" trend, potentially because 6 years are not enough (?):

  data_from data_to data_nyears minyear maxyear n nboot_lt  rate_lt rate_lt_low rate_lt_upp   pcn_lt TrendClass_lt
0      2017    2022           6    2017    2022 6     1000 1.139927   0.7021053     19.6853 92.47969     Uncertain

But then as it creates NAs in co_index, the rest of the script to plot the species log collated index with bootstrapped 95% CI and linear trend line is not working anymore so I guess I will need to remove the NAs somehow.

RetoSchmucki commented 1 year ago

@SarahVray my mistake, I copied a code with 2017 as the first year sp_trend <- estimate_boot_trends(co_index, minyear = 2017) it should be sp_trend <- estimate_boot_trends(co_index)

and for the plot, you will need to add na.rm = TRUE for the mean and quantile, something like this

co_index0_mean <- mean(co_index[BOOTi == 0]$LOGDENSITY, na.rm = TRUE)
# Derive interval from quantiles
co_index_ci <- merge(co_index[BOOTi == 0, .(M_YEAR, TRMOBS)],
    co_index[BOOTi != 0, .(
        LOWER = quantile(LOGDENSITY - co_index0_mean + 2, 0.025, na.rm = TRUE),
        UPPER = quantile(LOGDENSITY - co_index0_mean + 2, 0.975, na.rm = TRUE)
    ), by = .(M_YEAR)],
    by = c("M_YEAR")
)

spp <- unique(co_index[,SPECIES])

ggplot(co_index_ci, aes(M_YEAR, TRMOBS)) +
    theme(text = element_text(size = 14)) +
    geom_line() +
    geom_point() +
    geom_ribbon(aes(ymin = LOWER, ymax = UPPER), alpha = .3) +
    geom_smooth(method = "lm", se = FALSE, color = "red") +
    xlab("Year") +
    ylab(expression("log "["(10)"] * " Collated Index")) +
    ggtitle(paste("Collated index for", gsub("_", " ", spp)))

Looking at this figure, you see why the trend from 2017 to 2022 was classified as uncertain.

SarahVray commented 1 year ago

@RetoSchmucki, thanks it works :-). I get:

  data_from data_to data_nyears minyear maxyear n nboot_lt  rate_lt rate_lt_low rate_lt_upp     pcn_lt   TrendClass_lt
0      2015    2022           8    2015    2022 8     1000 13.68359    5.504991    34.75508 8982546761 Strong increase

Yes indeed I can see the plot now. Thanks again!

SarahVray commented 12 months ago

Hi @RetoSchmucki, testing this new code on a bigger dataset, I get another error message at the same step, when estimating the species trends with:

source("workshop_functions.R")
species.names <- unique(LUBMS_count[order(SPECIES), SPECIES])
for(i in 1:length(species.names)){
  co_index <- readRDS(file.path("K:/BIODIV/BIOLYS/3_Working area/WP2 - butterfly trends/Working directory/1-ALL transects/co_index_boot", paste( gsub(" ", "_", species.names[i]), "co_index_boot.rds", sep="_")))
  # BOOTi: 0 when raw data

  yr <- unique(co_index[order(M_YEAR),][!is.na(NSITE), as.numeric(as.character(M_YEAR))])
  if(length(yr) < 1) next()  ## skip species[i] and move to next

  ## Only use COL_INDEX larger then 0 for calculation of logdensity and trmobs
  co_index[COL_INDEX > 0, LOGDENSITY := log(COL_INDEX) / log(10)][
    COL_INDEX > 0, TRMOBS := LOGDENSITY - mean(LOGDENSITY, rm.na = TRUE) + 2, 
    by = .(BOOTi)]

  # Estimate and classify species trends with a CI based on the bootstraps:
  # fitting all the regressions of bootstraps separately and aggregate them
  sp_trend <- estimate_boot_trends(co_index)

  trend_path <- file.path("K:","BIODIV","BIOLYS","3_Working area","WP2 - butterfly trends","Working directory","1-ALL transects", "trend",paste(species.names[i], "trend.csv", sep="_"))
  write.csv(sp_trend, file=trend_path, row.names=T)

# Plot the species log collated index with bootstrapped 95% CI and linear trend line (in red):
# Calculate mean log index for original data
co_index0_mean <- mean(co_index[BOOTi == 0]$LOGDENSITY, na.rm = TRUE)
# Derive interval from quantiles
co_index_ci <- merge(co_index[BOOTi == 0, .(M_YEAR, TRMOBS)],
                     co_index[BOOTi != 0, .(LOWER = quantile(LOGDENSITY - co_index0_mean + 2, 0.025, na.rm = TRUE),
                                            UPPER = quantile(LOGDENSITY - co_index0_mean + 2, 0.975, na.rm = TRUE)), 
                     by = .(M_YEAR)],
                     by=c("M_YEAR"))

# Save plot as jpeg:
collated_index_path <- file.path("K:","BIODIV","BIOLYS","3_Working area","WP2 - butterfly trends","Working directory","1-ALL transects","collated index","Plot version 2",paste(species.names[i], "collated index.jpg", sep="_"))
jpeg(file=collated_index_path, width=800, height=600,units="px",bg="white",res=100,pointsize = 12)
col_plot <- ggplot(co_index_ci, aes(M_YEAR, TRMOBS))+
  theme(text = element_text(size = 14))+
  geom_line()+
  geom_point()+
  geom_ribbon(aes(ymin = LOWER, ymax = UPPER), alpha = .3)+
  geom_smooth(method="lm", se=FALSE, color="red")+
  xlab("Year")+ylab(expression('log '['(10)']*' Collated Index'))+
  ggtitle(paste("Collated index for", gsub("_", " ", species.names[i]), "in LUBMS"))
print(col_plot + scale_x_continuous(limits = c(2010, 2022), n.breaks=7))
dev.off()
}

And with the new code you made and that I copy-pasted in workshop_functions.R. It stops at A. aglaja (for which a flight curve could be computed only in 2022): Argynnis aglaja_collated index.csv Argynnis aglaja_pheno.csv

I get this error message:

Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match
In addition: Warning messages:
1: Removed 5 rows containing non-finite values (stat_smooth). 
2: Removed 5 row(s) containing missing values (geom_path). 
3: Removed 5 rows containing missing values (geom_point).

Do you see where it could come from? Thank you again for your help.

RetoSchmucki commented 11 months ago

Hi @SarahVray I was away and only started investigating this today. I fixed the issue you had above and that was caused by a different number of years across bootstrap iterations. I fixed the function, and it is available in workshop_functions.R. Fingers crossed!

SarahVray commented 10 months ago

Hi @RetoSchmucki, Sorry for my late reply, I was away for a long time... Thank you very much for your reply and the fix. There is still two "weird" things for some species in some years:

Sometimes there is no collated index at all shown on the graph (no point and no confidence interval), despite the presence of occurrences. Here is the example of C. argiolus: For 2017, the collated index is always equal to 0 in the output file but there are occurrences for that year: Celastrina argiolus_collated index.csv
Sometimes there is a point (collated index) but no confidence interval (here in 2022):

Do you know where these issues could come from? Thanks in advance for your help.

RetoSchmucki commented 8 months ago

I don't know what is causing these issues but will have a look and try to reply soon.