function for calculating scales

wibeasley commented 1 year ago

inputs:

[x] vector of column names
[x] minimum count of nonmissing columns
[ ] weights vector

wibeasley commented 1 year ago

@genevamarshall, @yutiantang and others are using sjstats::mean_n(). It doesn't support nonuniform weights. And (at least currently) uses a slow approach that involves casting the data.frame to a matrix.

wibeasley commented 11 months ago

I've been working on something that meets all these requirements except for for the nonuniform weights. https://github.com/LiveOak/vasquez-border-reentry-1

row_sum <- function(
    d,
    columns_to_average        = character(0),
    pattern, 
    new_column_name  = "row_sum",
    threshold_proportion      = .75,
    verbose                   = FALSE
) {

  if (length(columns_to_average) == 0L) {
    columns_to_average <-
      d |>
      colnames() |>
      grep(
        x         = _,
        pattern   = pattern,
        value     = TRUE,
        perl      = TRUE
      )

    if (verbose) {
      message(
        "The following columns will be summed:\n- ",
        paste(columns_to_average, collapse = "\n- ")
      )
    }
  }

  d |>
    dplyr::mutate(
      row_sum = # Finding the sum (used by m4)
        rowSums(
          dplyr::across(!!columns_to_average),
          na.rm = TRUE
        ),
      nonmissing_count =
        rowSums(
          dplyr::across(
            !!columns_to_average,
            .fns = \(x) { !is.na(x) }
          )
        ),
      nonmissing_proportion = nonmissing_count / length(columns_to_average),
      {{new_column_name}} :=
        dplyr::if_else(
          threshold_proportion <= nonmissing_proportion,
          row_sum,
          # row_sum / nonmissing_count,
          NA_real_
        )
    ) |>
    dplyr::select(
      -row_sum,
      -nonmissing_count,
      -nonmissing_proportion,
    )
  # Alternatively, return just the new columns
  # dplyr::pull({{new_column_name}})
}

DavidBard commented 1 week ago

@wibeasley Feature request and questions: FR: Would be nice to have a row_mean function as well, which averages across all nonmissing items. Q1: For row_sum, should 'columns_to_average' argument be 'columns_to_sum' instead? Q2: Can you provide an example of how this function might be used inside a dplyr::mutate statement?

wibeasley commented 1 week ago

@DavidBard,

[x] sure: #142
[x] good catch: #141

see https://ouhscbbmc.github.io/OuhscMunge/reference/row_sum.html#examples

OuhscBbmc / OuhscMunge

function for calculating scales #126