Feature request: A few changes in the tests (add_p)

Generalized commented 2 years ago

Dear Authors, First of all, I would like to express my respect and gratitude for your work and efforts to facilitate automated reporting.

As I come from the clinical trials industry, where such tables are typically used, I think it won't be tactless from my side, if I propose, for your consideration, a few changes relevant in the industry?

Please, don't find it as a criticism, only a few suggestions to improve quality of reporting. This post is going to be very long, but this is a technical discussion, so the details have to be provided along with justification.

If you will find some of my proposal reasonable to implement, you don't have to break your existing API. This could be handled by "options" or "switches", to make it backward compatible. People don't like breaking changes... especially in pharma. In this light - the changes below don't create a mess, only extend the current functionality.

Let's start!

1) Mann-Whitney (-Wilcoxon). I know it's a kind of standard in reporting, but the test suffers in case of unequal dispersions. The Brunner-Munzel handles this elegantly and is available in R, so can be used easily. I was happy to find a good summary titled:

Karch JD. Psychologists Should Use Brunner-Munzel’s Instead of Mann-Whitney’s U Test as the Default Nonparametric Procedure. Advances in Methods and Practices in Psychological Science. April 2021. doi:10.1177/2515245921999602 https://journals.sagepub.com/doi/full/10.1177/2515245921999602

It can start as a more precise permutation test (you can fix the seed for reproducibility and publish it for transparency, say 1000) and it will automatically switch to the non-permutation version when the sample size is sufficient.

2) t-test. For exactly the same reason as above, the t-test could be replaced by default by the Welch t-test. And again, a good summary by prof. Daniel Lakens aka "The 20% Statistician" http://daniellakens.blogspot.com/2015/01/always-use-welchs-t-test-instead-of.html

3) lme4 for logistic regression (or more generally - for any kind of the GLM with non-identity link). If your package will be used in clinical research for example to compare treatment arms in RCTs with binary primary endpoint (e.g. the % of successes), then a very important issues comes.

There are 2 kinds of models accounting for repeated/clustered data:

conditional to a random-effect (e.g. a patient) - handles by mixed models, requiring one to specify the structure of the random effects (translating to the G-side covariance matrix). They are used typically for prediction.
population-average aka marginal, averaging over all possible random effects, ignoring any specific structure of them. This is handled by either the GLS estimation (Generalized Least Squares), like SAS does for the "REPEATED" clause, or GEE (weighted or doubly robust in presence of data not MCAR). They are used typically for confirmation (e.g. "does the drug under investigation affect the measured outcome?"). If we assess the effect of compared drugs (study arms), according to the FDA/EMA regulations we don't care about any specific "sub-groups", characteristics" and so on. We want to see if it works globally, on average. Why? Well, that's simple - because it will be administered to many different people, each with own specifics and organism response (so complicated, so we cannot even imagine to what extend should we - and CAN - account for it!), so we don't want to artificially introduce any specific "constraints" other than the trial inclusion/exclusion criteria (defining the target group for which the drug is dedicated). We do such comparisons via so-called MMRM models (contrary to its name, it contains only fixed effects), with specified residual covariance structure (rather than via random effects).

That's, by the way, also the reason for which the default covariance structure in such analysis is always "unstructured". Only when it fails to converge, one can use simpler ones, like compound symmetry or Topelitz, in the order pre-specified in the SAP (Statistical Analysis Plan).

While the interpretation of conditional and population-average models is equivalent in case of the general linear model (Gaussian conditional response, identity link), it differs a lot for non-identity links. The population-average models typically have attenuated coefficients, there are even closed-form formulas for the mixed logistic regression with random intercept, but there is no such thing for any other model (AFAIR).

Note: It doesn't mean that either of the 2 models is better than the other. By no means. They are both valid, but answering different questions. And for typical clinical comparisons we need the population-average one.

OK, now, to the point: instead of employing the lme4 glmer for the repeated logistic regression with random intercept (which already introduces some constraint - and it's not equal to compound symmetry in general, please see below), I'd strongly propose replacing it with GEE with unstructured covariance. And when it fails, to fall back to the compound symmetry. In presence of missing data (any kind, without even testing it) - I'd employ the IPS (Inverse-Probability Weighting) via "ipw" package (e.g. Kim, KyungMann; Bretz, Frank; Hampson, Lisa V.; Cheung, Ying Kuen, Handbook of Statistical Methods for Randomized Controlled Trials) or CRTgeeDR.

Please find 2 examples on how to use it together: https://rstudio-pubs-static.s3.amazonaws.com/415697_d6fed7f7b9754c0999cc2b4075013c81.html https://www.andrewheiss.com/blog/2021/01/15/msm-gee-multilevel/

Last, but not least, there is a problem with the random-intercept LMM. It is equivalent to the compound symmetry only, if the within correlation (over time) is non-negative. While the GLS or GEE will handle negative correlations, the LMM cannot do that, as variance components cannot be negative, so will zero it. The impact will depend on case. Below I put an excerpt from one of my SAPs explaining this. Such case isn't frequent, but it does happen. obraz

I know your goal was simplicity, so let's keep it simple. How about replacing the glmer with geeglm (geepack) with compound symmetry? It's almost guaranteed to converge.

What about obtaining the main (and interaction, if we spoke about 2+ factors) effect rather than just p-values for the model coefficients? The emmeans::joint_tests (I noticed you use the emmeans) will provide the Wald's joint test for the entire categorical factor(s).

Mood's test of medians. This test if known to be rather weak (but I cannot find the source right now, just remembered it). Some people probably would believe, that Mann-Whitey does it, but of course it doesn't, unless strong (and often unrealistic) assumptions are met about the distribution of compared groups (IID: same dispersion, same direction of shape, symmetric around their medians). There is, however, an analysis, that can give a powerful answer to this question: the linear quantile regression. If you choose the q=0.5, then it will give true difference in medians (not median difference, pseudomedian difference and so on). Obtaining the p-value for the entire categorical variable - via emmeans::joint_tests over the regression.

(BTW, in R the quantile regression can handle also random effects, but the problem is a described before).

For the survival analysis I'd propose the Max-Combo test in presence of non-proportional hazards. It's a test proposed by the Non-Prop. working group initiated by the FDA a few years ago. Nothing very complicated - it employs the Fleming-Harrington test statistic for 4 combinations of parameters (one of them is equal to the logrank), takes maximum and calculates the p-value based on the the multivariate normal distribution. It's implemented in the nph package (I validated the result with the Author and it works well).

/ There are also other implementations, e.g. SimTrial (Merck) at Github or maxcombo package - has the weirdest syntax I've ever seen, but works. I recommend the nph. /

This test can handle early effects, late effects, crossing curves/effects, diminishing effects. It's not good for RCT (where we need to anticipate the kind of effect), but is a life-saver in exploratory analyses, where the log-rank fails due to masked piecewise hazard ratios and it gains popularity, Google for: site:clinicaltrials.gov SAP + "max-combo".

A few resources, if you'd like to read about it:

http://onbiostatistics.blogspot.com/2021/04/non-proportional-hazards-how-to-analyze.html
https://healthpolicy.duke.edu/sites/default/files/2020-03/oncology_trials_workshop_meeting_summary_0.pdf
Freidlin B, Korn EL. Methods for Accommodating Nonproportional Hazards in Clinical Trials: Ready for the Primary Analysis?. J Clin Oncol. 2019;37(35):3455-3459. doi:10.1200/JCO.19.01681 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7001779/
https://ww2.amstat.org/meetings/biopharmworkshop/2018/onlineprogram/ViewPresentation.cfm?file=300795.pdf
https://tinyurl.com/2ejlobky
In SAS: https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2020/5062-2020.pdf
https://pubmed.ncbi.nlm.nih.gov/33759337/
for a 2-way ANOVA, I'd allow for the type-2 or type-3 one in case of unbalanced designs. In such case, for the type-1, the interpretation of main effects depends on the order they entered the model (e.g. y ~ sex age != y ~ age sex). In my field only type 3 and 2 are of interest. Of course, I'm aware of the (valid!) criticism of the type-3 ANOVA and interpretation of the main effects in presence of (dis-ordinal - mainly) interaction, no need to argument. But the industry standards are clear and one cannot deny it. Moreover, the choice of ANOVA should be decided by the author, not chosen "blindly". If there are no interactions, type 2 can be obtained via car::Anova(... type=2) (and AFAIR emmeans::joint_tests() can do it for some models too). One only has to remember to set contr.sum for the categorical variable and make sure (when using emmeans) to test "at zero" rather than "at mean value". One can also achieve type-2 via classic stats::anova() by shuffling the order of variables and taking the first p-value from each fit. But definitely not type-1, unless it's either balanced or single-variable (one-way). In this case the problem of interpretation doesn't occur. If your goal was just to provide one-way ANOVA, then indeed, aov (wrapper for lm()) will be fine.
Last, but not least, if you would like to allow users to test non-parametrically for 2+ factors + interactions (which cannot be handled by Kruskal-Wallis or the weak Friedman), please let me know. I don't want to make this post giant. Just a few keywords: ATS (ANOVA-Type Statistic), WTS (Wald-Type Statistic), ART (Aligned-Rank Transform) ANOVA, ordinal logistic regression (generalization of Mann-Whitney for 2+ factors + interactions adjusted for numerical covariates), GEE (including GAM via smoothers). Packages: nparLD, GFD , rankFD and ARTool. More proposals and references can be found here: https://www.quora.com/Is-there-any-reliable-non-parametric-alternative-to-two-way-ANOVA-in-biostatistics/answer/Adrian-Olszewski-1?ch=10&share=2dada943&srid=MByz

I've been successfully using these methods (except Max-Combo, which is quite new) on regular basis at work with non-RCTs for about 7 years.

ddsjoberg commented 2 years ago

Hello @Generalized ! thank you for the detailed post! and of course, thank you for the kind words about the package!

I will need some time to digest all the nuance and details included, but I wanted to respond to let you know that I've seen it and appreciate it :)

A couple of things to keep in mind about any tests we incorporate:

I don't want to include any additional dependencies in the package. every dependency increases the risk of something going wrong upstream, and often creates a maintenance nightmare.
That doesn't mean we can't land on a good solution to make these tests available in gtsummary. Maybe we just create a new package called gtsummaryTests (or hopefully something more catchy!), and we are free to drop any test in there from any package? I have had other requests to add tests from various R packages that could also be added
Tests/Methods used in add_p() and add_difference() are pretty easy to extend FYI https://www.danieldsjoberg.com/gtsummary/reference/tests.html#custom-functions-1
for each of the tests listed above, we'll need example code on how they are run, what the expected data structure is, what the results look like. (no need to work on this now)
FYI Welch's t-test is the default method in t.test(). To assume a common variance, you must specify var.equal=TRUE (which you can do with add_p(test.args=)
final thought for the moment: I am not sure i have the bandwidth to take on these additions. do you have experience as an R package dev? could we collaborate?

ddsjoberg commented 2 years ago

There is a lot here, and I am going to go ahead and close this issue for now. If you're available to assist in implementing any of these methods, please open a separate issue for a single method/ Please review the code for implementing custom add_p() and add_difference() methods, and include codes to implementing the method you're requesting.

Generalized commented 2 years ago

Dear @ddsjoberg , Thank you for your explanations.

I found this file: https://github.com/ddsjoberg/gtsummary/blob/3c63f7200630e22d1496e9c4b0e8bb50e2fa5807/R/utils-add_p_tests.R and it's a brilliant starter, as you have already managed a number of important challenges (e.g. completeness of paired data). I can reuse this code to make it consistent with your current way.

Since I don't know/use GitHub, I will follow your advice on creating a separate issue per method with the code based on the existing examples.

I have just 3 technical questions: 1) I noticed that the methods comparing several groups are 1-way, am I right? I guess it's because it "naturally fits" the layout of a table with group means in subsequent rows + overall (global) test of them. For n-way the representation would be tricky how to present it and very challenging (bearing in mind type 2/3, various methods like LRT vs. Wald's, different engines via emmeans, car::Anova, stats::anova, lmTest and lmtest, clubSandwich), which is not the goal of this package (table summaries with several helpers, not a general testing engine).

2) I fully concur with your that multiplying dependencies makes the whole package unstable. But the issue is that several key methods, like the gls() (for clustered/repeated data with selected covariance structure), like geeglm (for the marginal logistic regression for repeated data), or brunnermunzel.test, anderson.darling (for 2 numerical variables), a test for comparing quantiles (medians, any other quantile; typically via quantile regression and Harrell-Davis method, e.g. in WRS2::qcomhd) would need reference to their packages (nlme, geepack, brunnermunzel, kSamples, WRS2) respectively.

I fully know the pain with R packages! Today a package exists, and tomorrow it's nothing but a memory... But at least nlme, geepack and WRS2 are stable for abut a decade, so I guess relying on then should be "no less safe" than using the others. I will just propose a few solutions (I hope still this year) so you can finally decide if it's worth integrating with the package, right?

3) For several of these methods there are confidence intervals available. If a test gives a CI, where should I put it? As a "slot" returned by the add_p()? For example, the add_p_test_ancova_lme4() or add_p_test_cohens_d() gives it this way. Should I follow this approach?

ddsjoberg commented 2 years ago

Yes, everything is one-way comparisons. Those fit nicely with add_p() and add_difference(). You can use add_stat() to add custom statistics that are multiple columns/rows.
Please review the document for writing custom methods. The output must match the standard "tidy" output from the {broom} pkg, e.g. estimate is the difference, conf.low lower bound, conf.high upper bound, p.value is the p-value, etc.

ddsjoberg / gtsummary

Feature request: A few changes in the tests (add_p) #1336