aphalo / ggpmisc

R package ggpmisc is an extension to ggplot2 and the Grammar of Graphics
https://docs.r4photobiology.info/ggpmisc
94 stars 6 forks source link

FR: show exact p-value instead of "P<0.001" for small p-values in `stat_correlation()`, `stat_poly_eq()` and `stat_ma_eq()` and `stat_multcomp()` #52

Closed wbvguo closed 3 months ago

wbvguo commented 5 months ago

Dear ggpmisc developer,

I was wondering if we could provide a option to show the exact p-values in scientific form for the very small p-values. For example when p=0.000000001, it's great to see p = 1e-9

Currently the ggpmisc only shows the label P < 0.001. Users can change p.digits to indicate how many digits to show, but it doesn't look good if there are too many 0s

Thanks!

wbvguo commented 5 months ago

This can be achieved by manually setting the label with sprintf()

mpg %>% 
    ggplot(aes(x = cty, y = hwy)) +
    geom_point() + 
    geom_smooth() + 
    stat_correlation(aes(label = sprintf("R=%.4g, P=%.3g", after_stat(cor), after_stat(p.value))), output.type="numeric")

image

but I guess it would be nice to have an option like p.label.format="scientific" in the stat_correlation() function to do it

aphalo commented 5 months ago

Thanks for the suggestion! This is a very special use case as in most circumstances differences among very small p-values are not informative. I see the readymade labels as a convenience for the most frequent use cases, and the use of sprintf() the normal approach for other cases. Sorry, but in this case I will avoid "feature creep" and not implement this feature.

wbvguo commented 5 months ago

Hi, thanks for the comment, this is understandable. Although I have a different opinion on the following comment

"in most circumstances differences among very small p-values are not informative."

In most circumstances, it's still much better to have a higher resolution of p-values than p<0.001 (sacrificing precision should be generally avoided unless there is a compelling reason to). E.g. for many case, it's probably desirable to see p < 1e-16 instead of p<0.001 or p<0.0000000000000001

The difference is more pronounced under model comparison/selection settings when p-values need to be compared: if 2 or more models all did not perform bad in fitting the data, which one is better?

but I understand that such feature request might have small marginal revenue. For users who are seeking a work-around solution, here is an easy way to achieve it using ggpubr::stat_cor()

mpg %>% 
    ggplot(aes(x = cty, y = hwy)) +
    geom_point() + 
    geom_smooth() + 
    stat_cor(p.accuracy = 0.000001, r.accuracy = 0.01)

image

Hope this helps.

aphalo commented 5 months ago

@wbvguo My view is that differences among very small p-values are irrelevant in the case of tests of sinificance, and that there are better criteria for model selection than p-values of individual model fits. When computing correlations, this may not always apply. Still, I would rather prefer to use confidence intervals. However, what I want to avoid are changes the user interface or behaviour for the cases that are already suportted. Maybe I misundertood your suggestion when I first read your message. Changing the behaviour to use exponential notation when p.digits is set to a large number is doable, so I am reopening this issue. p.digits == Inf could be also handled as a special case using exponential notation to show the actual value. Your suggestion is also relevant for stat_poly_eq() and stat_ma_line() and stat_multcomp().

aphalo commented 5 months ago

Here is an example with a CI (I removed the smooth line given by the fitted spline, to avoid confusions, as the R value is for a linear correlation, not for the smooth line displayed in the examples above.)

library(ggpmisc)

mpg |>
  ggplot(aes(x = cty, y = hwy)) +
  geom_point() + 
  stat_correlation(r.conf.level = 0.95,
                   use_label(c("R", "R.CI")),
                   r.digits = 3)

Created on 2024-03-23 with reprex v2.1.0

aphalo commented 5 months ago
aphalo commented 5 months ago

@wbvguo A preliminary implementation of your request is now done. In the case of labels as R expressions, the e-notation is replaced by actual powers of 10. With labels as text, the e-notation is used. You can install this version from GitHub, or more easily from the repository at https://aphalo.r-universe.dev.

Please, let me know if this change addresses your suggestion correctly.

wbvguo commented 5 months ago

Hi @aphalo ,

Thank you very much for your attention and efforts on this Feature request! The current implementation should be adequate for most cases :+1:


what I want to avoid are changes the user interface or behaviour for the cases that are already suportted.

I echo your point on this matter. Indeed, we should avoid versions breaking user's code, thanks for keeping it in this way.


p.digits == Inf could be also handled as a special case using exponential notation to show the actual value.

This is a good idea! Considering the current p.digits is integer, it's worth considering to use of the non-integer part for a more powerful expression if it doesn't hurt. For example condition action example (e.g. p=0.000425 )
p.digits==Inf show the exact value* p.digits==Inf will show p = 0.000425
p.digits=a>=1 is integer show if p-value is smaller than e^-a p.digits=3 will show p < 0.001
p.digits=a.b>=1 is not integer** show if p-value is smaller than b*e^-a p.digits=2.5 will show p < 0.05
1 > p.digits > 0 show if p-value is smaller than the given value p.digits = 0.0005 will show p < 0.0005***
otherwise throw an warning the p.digits is not
legitimate, use default option instead

: a sensible default option is to present the exact value : a=as.integer(p.digits); b=p.digits-a ; by checking b< 1e-16 we can tell if p.digits represents integer or not : the default behavior is to show in scientific form if p < 1e-5, we could also find a way to control the threshold of using scientific form and how many digits to show for the fraction part of the scientific form

These suggestions might be cumbersome, but it should offer a comprehensive solution. Utilizing the decimal part enables users with more customization/control on what's to show. (with more customization comes with more expressive power)

I should have implemented it myself and pull a request, but I am currently too occupied to make code contributions. Honestly, I treat the PRs as a way to provide feedback (fulfill responsibility as a tool user if any part of them is constructive). In the end, I did not really expect any of them to be accepted or implemented, what you have done so far is impressive and really appreciated! :-)

Regards,

aphalo commented 5 months ago

@wbvguo Many thanks for your suggestions! The idea with p.digits is to avoid showing too many digits. Although smart, the idea of using the fractional part of numbers may be too complex for some users. The use of the 0 <= p.digits < 1 range seems more intuitive and could be implemented in the future, but I need some time to think about it. Because of this I will keep this issue open, but with a later milestone for this other feature.

aphalo commented 4 months ago

I think that the best approach would be to add formal parameters named p.critical and p.highlight.action. This would make it possible to show values like P < 0.050 for example in bold.

At the moment the label formating code is repeated, almost unchanged across multiple statistics. It needs to be factored out into separate functions (#57). Once refactored, I will come back to this issue.

aphalo commented 3 months ago

@wbvguo Hi Wenbin, I did some additional changes to the logic of how nurmber formatting is applied to P-values. These changes, I hope address a bit better your use case.

The formatting, as before is done with sprintf(). When p.digits=Inf is encountered, %e is used with a width of 4 digits in the mantisa. If p.digits is <= 4, then as before %f is used, but if p.digits > 4, %e is used instead. In addition, now the same format is used for displaying the numbers within range and the boundary for those too small to fit in the format (in other words, no longer P < 0.0000001 or similar are generated).

I just pushed this commit to GitHub. It may take some hours before it is built at the R-Universe repository.

As always, your comments are very welcome.

Thanks a lot for raising this issue, it has lead to an important improvement to the package!

aphalo commented 3 months ago

Support for most of these features is now implemented and documented, unit tests written, and testing completed.