Adding climatic properties to summary data

rdstern commented 5 months ago

@MeSophie the new Walter-Leith dialog (and others) would be easier to use if summary climatic data kept the information of what it was. Here is the daily data from Dodoma:

Now when there are summaries I suggest the default summary should keep the same name. So when we get monthly or annual mean for temp_max, the resulting summary variable is still temp_max. Same for temp_min. And the monthly or annual rainfall total is still of type rain. And so is the mean of the monthly rainfall totals
Other summaries could add that summary. So min_temp_min for the minimum of temp_min, etc. Also max_rain. When we get percentiles, I suggest we can simply have 10_rain, 25_rain. (That will faciliate the new seasonal graph.)

If that is all getting complicated, then just keep the names of the default variables for now.

rdstern commented 5 months ago

@MeSophie are you very busy with other things. Adding this, before too long, should be quick I hope and will make some of your other climatic dialogs run more smoothly?

MeSophie commented 5 months ago

@rdstern I'm on this issue since unfortunately it is not that easy. I realize that the name of the result is defined in the calculate_summary function, in general for all statistical operations. For example, changing max_tmax to tmax will change max_size to size. And in this case I wonder how the user will be able to easily detect the minimum, maximum, average median column since they will all have the same name. I could be wrong, which is why I would also like to hear @lilyclements point of view, as she understands the code better than I do.

DataBook$set("public", "calculate_summary", function(data_name, columns_to_summarise = NULL, summaries, factors = c(), store_results = TRUE, drop = TRUE, return_output = FALSE, summary_name = NA, result_names = NULL, percentage_type = "none", perc_total_columns = NULL, perc_total_factors = c(), perc_total_filter = NULL, perc_decimal = FALSE, perc_return_all = FALSE, include_counts_with_percentage = FALSE, silent = FALSE, additional_filter, original_level = FALSE, signif_fig = 2, sep = "_", ...) {
  if(original_level) type <- "calculation"
  else type <- "summary"
  include_columns_to_summarise <- TRUE
  if(is.null(columns_to_summarise) || length(columns_to_summarise) == 0) {
    # temporary fix for doing counts of a data frame
    # dplyr cannot count data frame groups without passing a column (https://stackoverflow.com/questions/44217265/passing-correct-data-frame-from-within-dplyrsummarise)
    # This is a known issue (https://github.com/tidyverse/dplyr/issues/2752)
    if(length(summaries) != 1 || summaries != count_label) {
      mes <- "When there are no columns to summarise can only use count function as summary"
      if(silent) {
        warning(mes, "Continuing summaries by using count only.")
        columns_to_summarise <- self$get_column_names(data_name)[1]
        summaries <- count_label
      }
      else {
        stop(mes)
      }
    }
    else columns_to_summarise <- self$get_column_names(data_name)[1]
    include_columns_to_summarise <- FALSE
  }
  if(!percentage_type %in% c("none", "factors", "columns", "filter")) stop("percentage_type: ", percentage_type, " not recognised.")
  if(percentage_type == "columns") {
    if(!(length(perc_total_columns) == 1 || length(perc_total_columns) == length(columns_to_summarise))) stop("perc_total_columns must either be of length 1 or the same length as columns_to_summarise")
  }
  if(!store_results) save <- 0
  else save <- 2
  summaries_display <- as.vector(sapply(summaries, function(x) ifelse(startsWith(x, "summary_"), substring(x, 9), x)))
  if(percentage_type == "factors") {
    manip_factors <- intersect(factors, perc_total_factors)
  }
  else manip_factors <- factors
  if(length(manip_factors) > 0) {
    calculated_from <- as.list(manip_factors)
    names(calculated_from) <- rep(data_name, length(manip_factors))
    calculated_from <- as.list(calculated_from)
    factor_by <- instat_calculation$new(type = "by", calculated_from = calculated_from)
    manipulations <- list(factor_by)
  }
  else manipulations <- list()
  if(percentage_type == "factors") {
    value_factors <- setdiff(factors, manip_factors)
    if(length(value_factors) > 0) {
      calculated_from <- as.list(value_factors)
      names(calculated_from) <- rep(data_name, length(value_factors))
      calculated_from <- as.list(calculated_from)
      factor_by <- instat_calculation$new(type = "by", calculated_from = calculated_from)
      value_manipulations <- list(factor_by)
    }
    else value_manipulations <- list()
  }
  sub_calculations <- list()

  i <- 0
  for(column_names in columns_to_summarise) {
    i <- i + 1
    # In the case of counting without columns, the first column column will be the "calculated from"
    # which will add unwanted column metadata
    calculated_from <- list(column_names)
    names(calculated_from) <- rep(data_name, length(calculated_from))
    j <- 0
    for(summary_type in summaries) {
      j <- j + 1
      function_exp <- ""
      # if(!is.null(weights)) {
      #   function_exp <- paste0(function_exp, ", weights = ", weights)
      # }
      extra_args <- list(...)
      for(i in seq_along(extra_args)) {
        function_exp <- paste0(function_exp, ", ", names(extra_args)[i], " = ", extra_args[i])
      }
      function_exp <- paste0(function_exp, ")")
      # function_exp <- paste0(function_exp, ", na.rm =", na.rm, ")")
      if(is.null(result_names)) {
        result_name = summaries_display[j]
        if(include_columns_to_summarise) result_name = paste0(result_name, sep, column_names)
      }
      #TODO result_names could be horizontal/vertical vector, matrix or single value
      else result_name <- result_names[i,j]
      if(percentage_type == "none") {
        summary_function_exp <- paste0(summary_type, "(x = ", column_names, function_exp)
        summary_calculation <- instat_calculation$new(type = type, result_name = result_name,
                                                      function_exp = summary_function_exp,
                                                      calculated_from = calculated_from, save = save)
      }
      else {
        values_calculation <- instat_calculation$new(type = type, result_name = result_name,
                                                     function_exp = paste0(summary_type, "(x = ", column_names, function_exp),
                                                     calculated_from = calculated_from, save = save)
        if(percentage_type == "columns") {
          if(length(perc_total_columns) == 1) perc_col_name <- perc_total_columns
          else perc_col_name <- perc_total_columns[i]
          totals_calculation <- instat_calculation$new(type = type, result_name = paste0(summaries_display[j], sep, perc_total_columns, "_totals"),
                                                       function_exp = paste0(summary_type, "(x = ", perc_col_name, function_exp),
                                                       calculated_from = calculated_from, save = save)
        }
        else if(percentage_type == "filter") {
          #TODO
        }
        else if(percentage_type == "factors") {
          values_calculation$manipulations <- value_manipulations
          totals_calculation <- instat_calculation$new(type = "summary", result_name = paste0(result_name, "_totals"),
                                                       function_exp = paste0(summary_type, "(x = ", column_names, function_exp),
                                                       calculated_from = calculated_from, save = save)
        }
        function_exp <- paste0(values_calculation$result_name, "/", totals_calculation$result_name)
        if(!perc_decimal) {
          function_exp <- paste0("(", function_exp, ") * 100")
        }
        perc_result_name <- paste0("perc_", result_name)
        summary_calculation <- instat_calculation$new(type = "calculation", result_name = perc_result_name,
                                                      function_exp = function_exp,
                                                      calculated_from = list(), save = save, sub_calculations = list(totals_calculation, values_calculation))
      }
      sub_calculations[[length(sub_calculations) + 1]] <- summary_calculation
    }
  }
  if(self$filter_applied(data_name)) {
    curr_filter <- self$get_current_filter(data_name)
    curr_filter_name <- curr_filter[["name"]]
    curr_filter_calc <- self$get_filter_as_instat_calculation(data_name, curr_filter_name)
    manipulations <- c(curr_filter_calc, manipulations)
  }
  if(!missing(additional_filter)) {
    manipulations <- c(additional_filter, manipulations)
  }
  combined_calc_sum <- instat_calculation$new(type="combination", sub_calculations = sub_calculations, manipulations = manipulations)
  out <- self$apply_instat_calculation(combined_calc_sum)
  # relocate so that the factors are first still for consistency    
  if (percentage_type != "none"){   
    out$data <- (out$data %>% dplyr::select(c(tidyselect::all_of(factors), tidyselect::all_of(manip_factors)), tidyselect::everything()))   
  }
  if(return_output) {
    dat <- out$data
    if(percentage_type == "none" || perc_return_all) return(out$data)
    else {
      #This is a temp fix to only returning final percentage columns.
      #Depends on result name format used above for summary_calculation in percentage case
      if (percentage_type != "none" && include_counts_with_percentage){
        dat <- dat %>% dplyr::mutate(dplyr::across(where(is.numeric), round, signif_fig))
        dat <- dat %>% dplyr::mutate(perc_count = paste0(count, " (", perc_count, "%)")) %>% dplyr::select(-c("count", "count_totals"))
      } else {
        dat[c(which(names(dat) %in% factors), which(startsWith(names(dat), "perc_")))]
      }
    }
  }
}
)

lilyclements commented 5 months ago

@rdstern I see. I just want to check I understand.

If we do a summary of a variable, you want us to still have that "Climatic Type" holding for the metadata? E.g.,

Name	Label	Class	Climatic Type
sum_rain		numeric	rain

@MeSophie Good exploration of the summary_calculation function. If what I've posted above is the case, then I don't think the default name of it should affect the type. Instead this is a change in the metadata?

MeSophie commented 5 months ago

If we do a summary of a variable, you want us to still have that "Climatic Type" holding for the metadata

@lilyclements Great I also think this way is better than changing the default name.

rdstern commented 5 months ago

@MeSophie and @lilyclements just to add that Station and year etc remain climatic in summary data. I was hoping the climatic summaries (sum_rain, mean_tmax) could do the same.

MeSophie commented 4 months ago

@lilyclements Sorry I am coming back to you with this problem. After several investigations, I still don't know how to add climatic properties to the summary data, even passing through the metadata. Your help here (R code or any guidance) will be most welcome. Thank you.

MeSophie commented 3 months ago

Please @lilyclements anything new about this isssue?

lilyclements commented 2 months ago

If I'm understanding correctly, I think this requires some larger changes in the metadata side of things as it turns out. I'm not sure where we want to have this on the priority list?

IDEMSInternational / R-Instat

Adding climatic properties to summary data #8934