Open rdstern opened 5 months ago
@MeSophie are you very busy with other things. Adding this, before too long, should be quick I hope and will make some of your other climatic dialogs run more smoothly?
@rdstern I'm on this issue since unfortunately it is not that easy. I realize that the name of the result is defined in the calculate_summary function
, in general for all statistical operations. For example, changing max_tmax
to tmax
will change max_size
to size
. And in this case I wonder how the user will be able to easily detect the minimum, maximum, average median column since they will all have the same name. I could be wrong, which is why I would also like to hear @lilyclements point of view, as she understands the code better than I do.
DataBook$set("public", "calculate_summary", function(data_name, columns_to_summarise = NULL, summaries, factors = c(), store_results = TRUE, drop = TRUE, return_output = FALSE, summary_name = NA, result_names = NULL, percentage_type = "none", perc_total_columns = NULL, perc_total_factors = c(), perc_total_filter = NULL, perc_decimal = FALSE, perc_return_all = FALSE, include_counts_with_percentage = FALSE, silent = FALSE, additional_filter, original_level = FALSE, signif_fig = 2, sep = "_", ...) {
if(original_level) type <- "calculation"
else type <- "summary"
include_columns_to_summarise <- TRUE
if(is.null(columns_to_summarise) || length(columns_to_summarise) == 0) {
# temporary fix for doing counts of a data frame
# dplyr cannot count data frame groups without passing a column (https://stackoverflow.com/questions/44217265/passing-correct-data-frame-from-within-dplyrsummarise)
# This is a known issue (https://github.com/tidyverse/dplyr/issues/2752)
if(length(summaries) != 1 || summaries != count_label) {
mes <- "When there are no columns to summarise can only use count function as summary"
if(silent) {
warning(mes, "Continuing summaries by using count only.")
columns_to_summarise <- self$get_column_names(data_name)[1]
summaries <- count_label
}
else {
stop(mes)
}
}
else columns_to_summarise <- self$get_column_names(data_name)[1]
include_columns_to_summarise <- FALSE
}
if(!percentage_type %in% c("none", "factors", "columns", "filter")) stop("percentage_type: ", percentage_type, " not recognised.")
if(percentage_type == "columns") {
if(!(length(perc_total_columns) == 1 || length(perc_total_columns) == length(columns_to_summarise))) stop("perc_total_columns must either be of length 1 or the same length as columns_to_summarise")
}
if(!store_results) save <- 0
else save <- 2
summaries_display <- as.vector(sapply(summaries, function(x) ifelse(startsWith(x, "summary_"), substring(x, 9), x)))
if(percentage_type == "factors") {
manip_factors <- intersect(factors, perc_total_factors)
}
else manip_factors <- factors
if(length(manip_factors) > 0) {
calculated_from <- as.list(manip_factors)
names(calculated_from) <- rep(data_name, length(manip_factors))
calculated_from <- as.list(calculated_from)
factor_by <- instat_calculation$new(type = "by", calculated_from = calculated_from)
manipulations <- list(factor_by)
}
else manipulations <- list()
if(percentage_type == "factors") {
value_factors <- setdiff(factors, manip_factors)
if(length(value_factors) > 0) {
calculated_from <- as.list(value_factors)
names(calculated_from) <- rep(data_name, length(value_factors))
calculated_from <- as.list(calculated_from)
factor_by <- instat_calculation$new(type = "by", calculated_from = calculated_from)
value_manipulations <- list(factor_by)
}
else value_manipulations <- list()
}
sub_calculations <- list()
i <- 0
for(column_names in columns_to_summarise) {
i <- i + 1
# In the case of counting without columns, the first column column will be the "calculated from"
# which will add unwanted column metadata
calculated_from <- list(column_names)
names(calculated_from) <- rep(data_name, length(calculated_from))
j <- 0
for(summary_type in summaries) {
j <- j + 1
function_exp <- ""
# if(!is.null(weights)) {
# function_exp <- paste0(function_exp, ", weights = ", weights)
# }
extra_args <- list(...)
for(i in seq_along(extra_args)) {
function_exp <- paste0(function_exp, ", ", names(extra_args)[i], " = ", extra_args[i])
}
function_exp <- paste0(function_exp, ")")
# function_exp <- paste0(function_exp, ", na.rm =", na.rm, ")")
if(is.null(result_names)) {
result_name = summaries_display[j]
if(include_columns_to_summarise) result_name = paste0(result_name, sep, column_names)
}
#TODO result_names could be horizontal/vertical vector, matrix or single value
else result_name <- result_names[i,j]
if(percentage_type == "none") {
summary_function_exp <- paste0(summary_type, "(x = ", column_names, function_exp)
summary_calculation <- instat_calculation$new(type = type, result_name = result_name,
function_exp = summary_function_exp,
calculated_from = calculated_from, save = save)
}
else {
values_calculation <- instat_calculation$new(type = type, result_name = result_name,
function_exp = paste0(summary_type, "(x = ", column_names, function_exp),
calculated_from = calculated_from, save = save)
if(percentage_type == "columns") {
if(length(perc_total_columns) == 1) perc_col_name <- perc_total_columns
else perc_col_name <- perc_total_columns[i]
totals_calculation <- instat_calculation$new(type = type, result_name = paste0(summaries_display[j], sep, perc_total_columns, "_totals"),
function_exp = paste0(summary_type, "(x = ", perc_col_name, function_exp),
calculated_from = calculated_from, save = save)
}
else if(percentage_type == "filter") {
#TODO
}
else if(percentage_type == "factors") {
values_calculation$manipulations <- value_manipulations
totals_calculation <- instat_calculation$new(type = "summary", result_name = paste0(result_name, "_totals"),
function_exp = paste0(summary_type, "(x = ", column_names, function_exp),
calculated_from = calculated_from, save = save)
}
function_exp <- paste0(values_calculation$result_name, "/", totals_calculation$result_name)
if(!perc_decimal) {
function_exp <- paste0("(", function_exp, ") * 100")
}
perc_result_name <- paste0("perc_", result_name)
summary_calculation <- instat_calculation$new(type = "calculation", result_name = perc_result_name,
function_exp = function_exp,
calculated_from = list(), save = save, sub_calculations = list(totals_calculation, values_calculation))
}
sub_calculations[[length(sub_calculations) + 1]] <- summary_calculation
}
}
if(self$filter_applied(data_name)) {
curr_filter <- self$get_current_filter(data_name)
curr_filter_name <- curr_filter[["name"]]
curr_filter_calc <- self$get_filter_as_instat_calculation(data_name, curr_filter_name)
manipulations <- c(curr_filter_calc, manipulations)
}
if(!missing(additional_filter)) {
manipulations <- c(additional_filter, manipulations)
}
combined_calc_sum <- instat_calculation$new(type="combination", sub_calculations = sub_calculations, manipulations = manipulations)
out <- self$apply_instat_calculation(combined_calc_sum)
# relocate so that the factors are first still for consistency
if (percentage_type != "none"){
out$data <- (out$data %>% dplyr::select(c(tidyselect::all_of(factors), tidyselect::all_of(manip_factors)), tidyselect::everything()))
}
if(return_output) {
dat <- out$data
if(percentage_type == "none" || perc_return_all) return(out$data)
else {
#This is a temp fix to only returning final percentage columns.
#Depends on result name format used above for summary_calculation in percentage case
if (percentage_type != "none" && include_counts_with_percentage){
dat <- dat %>% dplyr::mutate(dplyr::across(where(is.numeric), round, signif_fig))
dat <- dat %>% dplyr::mutate(perc_count = paste0(count, " (", perc_count, "%)")) %>% dplyr::select(-c("count", "count_totals"))
} else {
dat[c(which(names(dat) %in% factors), which(startsWith(names(dat), "perc_")))]
}
}
}
}
)
@rdstern I see. I just want to check I understand.
If we do a summary of a variable, you want us to still have that "Climatic Type" holding for the metadata? E.g.,
Name | Label | Class | Climatic Type |
---|---|---|---|
sum_rain | numeric | rain |
@MeSophie Good exploration of the summary_calculation
function. If what I've posted above is the case, then I don't think the default name of it should affect the type. Instead this is a change in the metadata?
If we do a summary of a variable, you want us to still have that "Climatic Type" holding for the metadata
@lilyclements Great I also think this way is better than changing the default name.
@MeSophie and @lilyclements just to add that Station and year etc remain climatic in summary data. I was hoping the climatic summaries (sum_rain, mean_tmax) could do the same.
@lilyclements Sorry I am coming back to you with this problem. After several investigations, I still don't know how to add climatic properties to the summary data, even passing through the metadata. Your help here (R code or any guidance) will be most welcome. Thank you.
Please @lilyclements anything new about this isssue?
If I'm understanding correctly, I think this requires some larger changes in the metadata side of things as it turns out. I'm not sure where we want to have this on the priority list?
@MeSophie the new Walter-Leith dialog (and others) would be easier to use if summary climatic data kept the information of what it was. Here is the daily data from Dodoma:
Now when there are summaries I suggest the default summary should keep the same name. So when we get monthly or annual mean for temp_max, the resulting summary variable is still temp_max. Same for temp_min. And the monthly or annual rainfall total is still of type rain. And so is the mean of the monthly rainfall totals
Other summaries could add that summary. So min_temp_min for the minimum of temp_min, etc. Also max_rain. When we get percentiles, I suggest we can simply have 10_rain, 25_rain. (That will faciliate the new seasonal graph.)
If that is all getting complicated, then just keep the names of the default variables for now.