ddsjoberg / gtsummary

Presentation-Ready Data Summary and Analytic Result Tables
http://www.danieldsjoberg.com/gtsummary
Other
1.06k stars 125 forks source link

Feature request: A few changes in the tests (add_p) #1336

Closed Generalized closed 2 years ago

Generalized commented 2 years ago

Dear Authors, First of all, I would like to express my respect and gratitude for your work and efforts to facilitate automated reporting.

As I come from the clinical trials industry, where such tables are typically used, I think it won't be tactless from my side, if I propose, for your consideration, a few changes relevant in the industry?

Please, don't find it as a criticism, only a few suggestions to improve quality of reporting. This post is going to be very long, but this is a technical discussion, so the details have to be provided along with justification.

If you will find some of my proposal reasonable to implement, you don't have to break your existing API. This could be handled by "options" or "switches", to make it backward compatible. People don't like breaking changes... especially in pharma. In this light - the changes below don't create a mess, only extend the current functionality.

Let's start!

1) Mann-Whitney (-Wilcoxon). I know it's a kind of standard in reporting, but the test suffers in case of unequal dispersions. The Brunner-Munzel handles this elegantly and is available in R, so can be used easily. I was happy to find a good summary titled:

Karch JD. Psychologists Should Use Brunner-Munzel’s Instead of Mann-Whitney’s U Test as the Default Nonparametric Procedure. Advances in Methods and Practices in Psychological Science. April 2021. doi:10.1177/2515245921999602 https://journals.sagepub.com/doi/full/10.1177/2515245921999602

It can start as a more precise permutation test (you can fix the seed for reproducibility and publish it for transparency, say 1000) and it will automatically switch to the non-permutation version when the sample size is sufficient.

2) t-test. For exactly the same reason as above, the t-test could be replaced by default by the Welch t-test. And again, a good summary by prof. Daniel Lakens aka "The 20% Statistician" http://daniellakens.blogspot.com/2015/01/always-use-welchs-t-test-instead-of.html

3) lme4 for logistic regression (or more generally - for any kind of the GLM with non-identity link). If your package will be used in clinical research for example to compare treatment arms in RCTs with binary primary endpoint (e.g. the % of successes), then a very important issues comes.

There are 2 kinds of models accounting for repeated/clustered data:

That's, by the way, also the reason for which the default covariance structure in such analysis is always "unstructured". Only when it fails to converge, one can use simpler ones, like compound symmetry or Topelitz, in the order pre-specified in the SAP (Statistical Analysis Plan).

While the interpretation of conditional and population-average models is equivalent in case of the general linear model (Gaussian conditional response, identity link), it differs a lot for non-identity links. The population-average models typically have attenuated coefficients, there are even closed-form formulas for the mixed logistic regression with random intercept, but there is no such thing for any other model (AFAIR).

Note: It doesn't mean that either of the 2 models is better than the other. By no means. They are both valid, but answering different questions. And for typical clinical comparisons we need the population-average one.

OK, now, to the point: instead of employing the lme4 glmer for the repeated logistic regression with random intercept (which already introduces some constraint - and it's not equal to compound symmetry in general, please see below), I'd strongly propose replacing it with GEE with unstructured covariance. And when it fails, to fall back to the compound symmetry. In presence of missing data (any kind, without even testing it) - I'd employ the IPS (Inverse-Probability Weighting) via "ipw" package (e.g. Kim, KyungMann; Bretz, Frank; Hampson, Lisa V.; Cheung, Ying Kuen, Handbook of Statistical Methods for Randomized Controlled Trials) or CRTgeeDR.

Please find 2 examples on how to use it together: https://rstudio-pubs-static.s3.amazonaws.com/415697_d6fed7f7b9754c0999cc2b4075013c81.html https://www.andrewheiss.com/blog/2021/01/15/msm-gee-multilevel/

Last, but not least, there is a problem with the random-intercept LMM. It is equivalent to the compound symmetry only, if the within correlation (over time) is non-negative. While the GLS or GEE will handle negative correlations, the LMM cannot do that, as variance components cannot be negative, so will zero it. The impact will depend on case. Below I put an excerpt from one of my SAPs explaining this. Such case isn't frequent, but it does happen. obraz

I know your goal was simplicity, so let's keep it simple. How about replacing the glmer with geeglm (geepack) with compound symmetry? It's almost guaranteed to converge.

What about obtaining the main (and interaction, if we spoke about 2+ factors) effect rather than just p-values for the model coefficients? The emmeans::joint_tests (I noticed you use the emmeans) will provide the Wald's joint test for the entire categorical factor(s).

(BTW, in R the quantile regression can handle also random effects, but the problem is a described before).

/ There are also other implementations, e.g. SimTrial (Merck) at Github or maxcombo package - has the weirdest syntax I've ever seen, but works. I recommend the nph. /

This test can handle early effects, late effects, crossing curves/effects, diminishing effects. It's not good for RCT (where we need to anticipate the kind of effect), but is a life-saver in exploratory analyses, where the log-rank fails due to masked piecewise hazard ratios and it gains popularity, Google for: site:clinicaltrials.gov SAP + "max-combo".

A few resources, if you'd like to read about it:

I've been successfully using these methods (except Max-Combo, which is quite new) on regular basis at work with non-RCTs for about 7 years.

ddsjoberg commented 2 years ago

Hello @Generalized ! thank you for the detailed post! and of course, thank you for the kind words about the package!

I will need some time to digest all the nuance and details included, but I wanted to respond to let you know that I've seen it and appreciate it :)

A couple of things to keep in mind about any tests we incorporate:

ddsjoberg commented 2 years ago

There is a lot here, and I am going to go ahead and close this issue for now. If you're available to assist in implementing any of these methods, please open a separate issue for a single method/ Please review the code for implementing custom add_p() and add_difference() methods, and include codes to implementing the method you're requesting.

Generalized commented 2 years ago

Dear @ddsjoberg , Thank you for your explanations.

I found this file: https://github.com/ddsjoberg/gtsummary/blob/3c63f7200630e22d1496e9c4b0e8bb50e2fa5807/R/utils-add_p_tests.R and it's a brilliant starter, as you have already managed a number of important challenges (e.g. completeness of paired data). I can reuse this code to make it consistent with your current way.

Since I don't know/use GitHub, I will follow your advice on creating a separate issue per method with the code based on the existing examples.

I have just 3 technical questions: 1) I noticed that the methods comparing several groups are 1-way, am I right? I guess it's because it "naturally fits" the layout of a table with group means in subsequent rows + overall (global) test of them. For n-way the representation would be tricky how to present it and very challenging (bearing in mind type 2/3, various methods like LRT vs. Wald's, different engines via emmeans, car::Anova, stats::anova, lmTest and lmtest, clubSandwich), which is not the goal of this package (table summaries with several helpers, not a general testing engine).

2) I fully concur with your that multiplying dependencies makes the whole package unstable. But the issue is that several key methods, like the gls() (for clustered/repeated data with selected covariance structure), like geeglm (for the marginal logistic regression for repeated data), or brunnermunzel.test, anderson.darling (for 2 numerical variables), a test for comparing quantiles (medians, any other quantile; typically via quantile regression and Harrell-Davis method, e.g. in WRS2::qcomhd) would need reference to their packages (nlme, geepack, brunnermunzel, kSamples, WRS2) respectively.

I fully know the pain with R packages! Today a package exists, and tomorrow it's nothing but a memory... But at least nlme, geepack and WRS2 are stable for abut a decade, so I guess relying on then should be "no less safe" than using the others. I will just propose a few solutions (I hope still this year) so you can finally decide if it's worth integrating with the package, right?

3) For several of these methods there are confidence intervals available. If a test gives a CI, where should I put it? As a "slot" returned by the add_p()? For example, the add_p_test_ancova_lme4() or add_p_test_cohens_d() gives it this way. Should I follow this approach?

ddsjoberg commented 2 years ago