gforge / Gmisc

An R package for creating tables, some plots and other useful utilities too small to merit their own package
49 stars 14 forks source link

getDescriptionStatsBy behavior for empty factor levels with a custom descriptive function #21

Closed lemna closed 7 years ago

lemna commented 7 years ago

Hi Max,

I regularly run into the following situation with clinical trial data:

There are two problems with this solution:

  1. when a cell in the first column of the table is missing an error is thrown. But when a cell in another column is missing, the values are substituted by 0.
  2. when all observations for a factor level are missing, a single empty row is returned. This makes it hard to predict what n.rgroup should be, and causes inconsistent table layout.

The first case is relatively easy. I've written some code to deal with it that won't change anything in existing calls to getDescriptionStatsBy. Would you accept a PR for this?

I am not sure how you would like the function to react in case 2. IMO, it should return a named vector with indicators for empty values. This named vector would be a new argument to getDescriptionStatsBy. The default value of this empty vector would be NULL so as not to disturb existing code. What do you think?

Thanks, Peter.

Toy example:

# datasets
 trial <- data.frame(visit = sort(rep(c("randomisation", "week1", "week2", "week3"), 5)),
                     arm = sort(rep(c("control", "treatment"))),
                     outcome = rnorm(40))
 trial_missing_first <- trial[!((trial$visit == "randomisation") & (trial$arm == "control")),]
 trial_missing_second <- trial[!((trial$visit == "randomisation") & (trial$arm == "treatment")),]
 trial_missing_both <- trial[!(trial$visit == "week3"),]

# toy function
 descriptive_function <- function(x, ...) {
   result <- c(describeMean(x, ...),
               describeMedian(x, ...))
   return(result)
 }

# no missingness - no problems
 (out <- mergeDesc(lapply(levels(trial$visit), function(x)
   getDescriptionStatsBy(x = trial$outcome[trial$visit == x],
                         by = trial$arm[trial$visit == x],
                         continuous_fn = descriptive_function)),
   htmlTable_args = list(rgroup = levels(trial$visit),
                         n.rgroup = rep(2, 4))))

# missingness in first column -> error.
 (out <- mergeDesc(lapply(levels(trial_missing_first$visit), function(x)
   getDescriptionStatsBy(x = trial_missing_first$outcome[trial_missing_first$visit == x],
                         by = trial_missing_first$arm[trial_missing_first$visit == x],
                         continuous_fn = descriptive_function)),
   htmlTable_args = list(rgroup = levels(trial_missing_first$visit),
                         n.rgroup = rep(2, 4))))

# missingness in second column -> no error
 (out <- mergeDesc(lapply(levels(trial_missing_second$visit), function(x)
   getDescriptionStatsBy(x = trial_missing_second$outcome[trial_missing_second$visit == x],
                         by = trial_missing_second$arm[trial_missing_second$visit == x],
                         continuous_fn = descriptive_function)),
   htmlTable_args = list(rgroup = levels(trial_missing_second$visit),
                         n.rgroup = rep(2, 4))))

# missingness in both columns -> problems with n.rgroup, inconsistent table layout
 (out <- mergeDesc(lapply(levels(trial_missing_both$visit), function(x)
   getDescriptionStatsBy(x = trial_missing_both$outcome[trial_missing_both$visit == x],
                         by = trial_missing_both$arm[trial_missing_both$visit == x],
                         continuous_fn = descriptive_function)),
   htmlTable_args = list(rgroup = levels(trial_missing_both$visit),
                         n.rgroup = rep(2, 4))))
gforge commented 7 years ago

Hi Peter,

interesting bug. It seems that by() returns a NULL when a variable is missing. I guess the easiest solution is to add a wrapper to the by() function that replaces the NULL with whatever it should be. I tried adding just:

    if (any(sapply(t, is.null))){
      for (i in which(sapply(t, is.null)))
        t[[i]] <- NA
    }

and it does take care of the fn_name error. In addition I think the addEmptyValuesToMakeListCompatibleWithMatrix should fill the matrix with NA and not 0 - not sure why I designed it in this way. This still leaves the last issue - this is on a higher design level, I would use an alternative design:

descs <- 
  lapply(levels(trial_missing_both$visit), function(x)
    getDescriptionStatsBy(x = trial_missing_both$outcome[trial_missing_both$visit == x],
                          by = trial_missing_both$arm[trial_missing_both$visit == x],
                          continuous_fn = descriptive_function))
mergeDesc(descs,
          htmlTable_args = list(rgroup = levels(trial_missing_second$visit),
                                n.rgroup = sapply(descs, nrow)))

This does leave you with an ugly rowname :-( but you could have a look at this section:

  if (is.null(rownames(results)) && nrow(results) == 1)
    rownames(results) <- name

A PR is of course always welcome. Thanks for the excellent report.

lemna commented 7 years ago

Hi Max,

I would prefer the following for the last issue:

The advantage is that in a script you don't have to guess or check what n.rgroup will be, and the correct rgroup will be used. What do you think?

Peter.

gforge commented 7 years ago

Sure, I don't have a strong opinion about it. The naming of argument allows ridiculously many parameters and why no use it?

gforge commented 7 years ago

Thanks! Merged. Will upload to CRAN this week.