darwin-eu / CDMConnector

A pipe friendly way to interact with an OMOP Common Data Model
https://darwin-eu.github.io/CDMConnector/
Apache License 2.0
12 stars 10 forks source link

BUG: `summarise_quantile()` fails without a compute beforehand. #24

Closed RWParsons closed 1 week ago

RWParsons commented 3 months ago

In my analyses process, I have a couple aggregation steps and want the summarize_quantile to be used later on in the pipeline. Currently, I get an error produced when trying to get quantiles: image

It works if I include a compute() after the ungroup() that comes beforehand but I don't want to have to include this step if I can avoid it.

Below is a reprex including the code which produces the error and another version with the compute() step.

library(CDMConnector)

con <- dbConnect(duckdb::duckdb(), "tempdb")

N <- 10000

d <- data.frame(
  claim = sample(1:100, size = N, replace = TRUE),
  code = sample(1:50, size = N, replace = TRUE),
  value = rnorm(n = N, mean = 1000, sd = 100)
)

dplyr::copy_to(con, d, name = "tmp", overwrite = TRUE, temporary = TRUE)

# fails
tbl(con, "tmp") |> 
  dplyr::group_by(claim, code) |> 
  dplyr::summarize(net_claim_code = sum(value, na.rm = TRUE)) |> 
  dplyr::ungroup() |> 
  dplyr::group_by(claim) |> 
  summarise_quantile(net_claim_code, probs = c(0, 0.2, 0.4, 0.6, 0.8, 1))

# works
tbl(con, "tmp") |> 
  dplyr::group_by(claim, code) |> 
  dplyr::summarize(net_claim_code = sum(value, na.rm = TRUE)) |> 
  dplyr::ungroup() |> 
  dplyr::compute()
  dplyr::group_by(claim) |> 
  summarise_quantile(net_claim_code, probs = c(0, 0.2, 0.4, 0.6, 0.8, 1))
ablack3 commented 3 weeks ago

Thank you for the reprex and my apologies for the late reply. I looked into it and I think this might be possible to fix but I'm not sure yet.

Here is a modified version of summarize_quantile that works on your example. However I mostly just removed checks and I'm not sure if this will cause problems for other cases.

summarise_quantile <- function(.data, x = NULL, probs, name_suffix = "value") {
  checkmate::assertClass(.data, "tbl_sql")
  checkmate::assert_double(probs, min.len = 1, lower = 0, upper = 1)
  checkmate::assert_character(name_suffix, null.ok = TRUE)

  selection_context <- .data$lazy_query$select_operation

  # if (!is.null(selection_context) && selection_context == 'summarise') {
  #   rlang::abort("Cannot estimate quantiles in summarise context.
  #                Try using `mutate()` function instead of `summarise()`")
  # }

  vars_context <- NULL
  x_context <- NULL
  x_arg <- rlang::enexpr(x)
  if (!is.null(selection_context)) {
    vars_context <- .data$lazy_query$select %>%
      dplyr::filter(unlist(purrr::map(.data$expr, rlang::is_quosure)))
    if (nrow(vars_context) > 0) {
      vars_context <- vars_context %>%
        # dplyr::mutate(x_var = purrr::map(purrr::map(.data$expr, rlang::get_expr), ~ if (length(.x) >= 2) {.x[[2]]} else {NULL}))
        dplyr::mutate(x_var = purrr::map(.data$expr, ~if(length(rlang::get_expr(.x)) >= 2) {rlang::get_expr(.x)[[2]]} else {NULL}))
      x_context <- unique(vars_context$x_var)[[1]]
    }
  }

  # if (!is.null(x_context) && !is.null(x_arg) && x_context != x_arg) {
  #     msg <- paste0("Confilicting quantile variables: `", x_context, "` (from context) and `", x_arg, "` (passed argument)")
  #     rlang::abort(msg)
  # }

  if (is.null(x_context) & is.null(x_arg)) {
    msg = "Quantile variable is not specified"
    rlang::abort(msg)
  }

  # x <- x_context %||% x_arg
  x <- x_arg

  group_by_vars <- .data$lazy_query$group_vars
  group_1 <- rlang::syms(c(group_by_vars, x))

  funs = list()
  if (!is.null(selection_context)) {
    funs <- purrr::map(vars_context$name, ~ rlang::expr(max(!!rlang::sym(.x), na.rm = TRUE)))
    names(funs) <- vars_context$name
  }

  group_2 <- rlang::syms(c(group_by_vars, names(funs)))

  probs = sort(unique(probs))
  quant_expr <- purrr::map(probs, ~ rlang::expr(min(ifelse(accumulated >= !!.x * total, !!x, NA), na.rm = TRUE)))
  names(quant_expr) <- paste0('p', as.character(probs * 100), '_', name_suffix)

  query <- rlang::expr(
    .data %>%
      dplyr::group_by(!!!group_1) %>%
      dplyr::summarise(..n = dplyr::n(), !!!funs, .groups = "drop") %>%
      dplyr::group_by(!!!group_2) %>%
      dbplyr::window_order(!!x) %>%
      dplyr::mutate(accumulated = cumsum(.data$..n),
                    total = sum(.data$..n, na.rm = TRUE)) %>%
      dplyr::summarize(!!!quant_expr, .groups = "drop")
  )
  eval(query)
}
RWParsons commented 3 weeks ago

Awesome - thanks Adam!

PRijnbeek commented 1 week ago

@ablack3 if this is something to prioritise please add in our backlog. Closing this here since @RWParsons seems to be helped by this provided solution. Please reopen if necessary