ESHackathon / CiteSource

http://www.eshackathon.org/CiteSource/
GNU General Public License v3.0
16 stars 2 forks source link

Inaccurate sensitivity calculation on phase_counts table #150

Closed TNRiley closed 1 year ago

TNRiley commented 1 year ago

The sensitivity should be calculated as the number of final included records for a source/total number of included records (CAB for example should be 4/242 x100 = 1.65)

However, the sensitivity is being calculated (again using CAB as an example) as 4 / 363 x100 = 1.1. This is due to the function summing all of the final included (0+4+3+4+162+67+123 = 363) and calculating it as 4/363.

This is not correct because the final included number in each row includes records that were found in multiple databases and therefore the sum of these numbers is not the true sum. The total sum is actually being pulled from elsewhere to create the table.

an example can be seen on the "source analysis across screening phases" vignette captures_chrome-capture-2023-5-8

TNRiley commented 1 year ago

I believe I've fixed the function, but want to confirm with Alison and the team before implementing it.

image

calculate_phase_count <- function(unique_citations, citations, db_colname) {

  count_source_phase <- function(source_phase_df, db_colname) {
    source_phase_df <- source_phase_df %>%
      tidyr::separate_rows(!!rlang::sym(db_colname), sep = ",") %>%
      tidyr::separate_rows(cite_label, sep = ",") %>%
      dplyr::mutate(!!rlang::sym(db_colname) := stringr::str_trim(!!rlang::sym(db_colname)),
                    cite_label = stringr::str_trim(cite_label)) %>%
      dplyr::filter(!!rlang::sym(db_colname) != "unknown") %>%
      dplyr::mutate(screened = ifelse(cite_label == "screened", 1, 0),
                    final = ifelse(cite_label == "final", 1, 0)) %>%
      dplyr::group_by(!!rlang::sym(db_colname)) %>%
      dplyr::summarise(screened = sum(screened),
                       final = sum(final),
                       .groups = "drop") %>%
      dplyr::rename(Source = !!rlang::sym(db_colname))

    return(source_phase_df)
  }

  source_phase <- count_source_phase(unique_citations, db_colname)

  distinct_count <- count_sources(unique_citations, db_colname) # Assuming that 'count_sources' function is correctly defined
  colnames(distinct_count) <- c("Source", "Distinct Records")

  distinct_count$`Distinct Records` <- as.numeric(distinct_count$`Distinct Records`)
  distinct_count$Source <- as.character(distinct_count$Source)

  combined_counts <- dplyr::left_join(distinct_count, source_phase, by = "Source")
  combined_counts[is.na(combined_counts)] <- 0

  combined_counts <- combined_counts %>%
    dplyr::mutate(Precision = ifelse(`Distinct Records` != 0, round((final / `Distinct Records`) * 100, 2), 0))

  # Calculate total_final before the loop
  total_final <- sum(citations$cite_label == "final")

  for(i in 1:nrow(combined_counts)) {
    combined_counts$Recall[i] <- round((combined_counts$final[i] / total_final) * 100, 2)
  }

  totals <- c("Total", 
              sum(combined_counts$`Distinct Records`, na.rm = TRUE),
              paste0(sum(citations$cite_label == "screened")),
              total_final, # Updated here
              "-",
              "-")
  combined_counts <- rbind(combined_counts, totals)

  return(combined_counts)
}