GEMINI-Medicine / Rgemini

A custom R package that provides a variety of functions to perform data analyses with GEMINI data
https://gemini-medicine.github.io/Rgemini/
Other
3 stars 0 forks source link

Add p-value function in cell_suppression.R as another feature for table1? #79

Closed shijiaSMH closed 5 months ago

shijiaSMH commented 6 months ago

New Feature Request

Please describe in detail the functionality you would like to have added to the package. Describe inputs and outputs, and provide test examples if possible:

pvalue <- function(x, ...) {
  x <- x[-length(x)]  # Remove "overall" group
  # Construct vectors of data y, and groups (strata) g
  y <- unlist(x)
  g <- factor(rep(1:length(x), times=sapply(x, length)))
  if (is.numeric(y)) {
    # For numeric variables, perform an ANOVA
    p <- summary(aov(y ~ g))[[1]][["Pr(>F)"]][1]
  } else {
    # For categorical variables, perform a chi-squared test of independence
    p <- chisq.test(table(y, g))$p.value
  }
  # Format the p-value, using an HTML entity for the less-than sign.
  # The initial empty string places the output on the line below the variable label.
  c("", sub("<", "&lt;", format.pval(p, digits=3, eps=0.001)))
}

Please mention (if applicable), any research projects for which this functionality was needed:

If you have an idea about how the logic of this function should work, please write some pseudocode below to help the developer:

strata <- c(list(Total=data), split(data, data$level))

labels <- list(
    variables=list(level='Level', sex='Sex'
                   )
    )

table1(strata, labels, groupspan=c(1, 1, 2), extra.col=list(`P-value`=pvalue) )

If you have already begun developing this function for your own use, please open a merge request with your in-progress code and link this issue to it.

Any suggested developers?

Any suggested reviewers?

shijiaSMH commented 6 months ago

Hi @loffleraSMH, I've discussed this w Surain today, so I wrote a summary about our thoughts/suggestions.

P-value

-It's an inferential statistics, usually used for hypothesis testing for more complex statistical testing. However, in table 1, our aim is to describe if different exposures may have different balance in characteristics. Thus, using p-value there may be an over-kill. -The more p-value tests we have, the higher the false positive rates. This may decrease accuracy of p-value results. -P-value is commonly mis-interrupted, eg. people tend to use 0.05 for everything and that is not necessarily scientific.

SMD

-It's usually used to describe if X characteristics is balanced across groups, which is in line with the purpose of having a descriptive table 1. -Potentially SMD is related to less misconception, since it's commonly used in clinical trials and people are used to this one type of interpretation.

Conclusion

From above, it's important to write clear documentation about when to use what if we decide to add p-value options.

Ex.

Please use maximum pairwise SMD for demographics since our aim is to describe baseline characteristics for cohort. Also, this approach is consistent among GEMINI publications. If the goal is for inferential statistics, p-values are common options. Specifically, chi-squared test is used for categorical variables. For continuous variables, ANOVA test is used for variables that are normally distributed, and kruskal-wallis test us used for variables that are not normally distributed.

vaakesan-SMH commented 6 months ago

Thanks @shijiaSMH for this thorough review and implementation! This is a great example of how to extend the existing functionality.

I believe the goal of Rgemini is to "standardize" our research and make common "analysis" easier to do.

I don't necessarily see an issue with exporting this but the question of whether to release this as part of the package is whether we see this being used "commonly" or if we want to "encourage" its use. Based on some of the discussion above this might be a point of contention. I think this is something that should probably be discussed with the larger group.

If not, as a niche use case, this could easily be a discussion board post too.

Looping in @loffleraSMH for thoughts.

loffleraSMH commented 6 months ago

Thanks @shijiaSMH and @vaakesan-SMH! Agree we should discuss this at research roundtable. Lots of things to consider here - thanks Jessica for the summary above, that's very helpful. I agree the decision comes down to whether we think this is a common approach we want to encourage people to use. If not, I like the idea of posting this on the discussion board.

loffleraSMH commented 5 months ago

Closing this issue since we decided to add this as a discussion post instead.