0todd0000 / spm1d

One-Dimensional Statistical Parametric Mapping in Python
GNU General Public License v3.0
61 stars 21 forks source link

p-values and interpretation #191

Closed Lucia-Scheffler closed 2 years ago

Lucia-Scheffler commented 2 years ago

Dear Todd,

we have been discussing the application and interpretation of the SPM. There were some open questions, we would like to ask you:

Question 1: Due to the stepwise comparison of the trajectories, it is not possible to determine a p-value for curves that do not show a statistically significant difference over their entire course. However, if two curves show a statistically significant difference over their entire course, it is possible to determine a p-value over the entire course with your toolbox. What is the difference between the two cases, so that a p-value can be calculated once and not in the other case?

Question 2: In one of your recent papers ( Honert, E. C., & Pataky, T. C. (2021). Timing of gait events affects whole trajectory analyses: A statistical parametric mapping sensitivity analysis of lower limb biomechanics. Journal of Biomechanics, 119, 110329-110329) you describe the poor localizing power of time-continuous analyses. This led us to the question of how we should interpret SPM results. As an example, a SPM plot shows a significant difference between two trajectories around 50% of the movement. This should indicate that there is a significant difference between the two trajectories. However, ca we say that this difference occurs around 50%/in the middle of the movement or do you have to say that we cannot localize the difference over the time course?

Thank you very much for your understanding and your support in advance.

Best regards, Cagla, Lucia and Felix

0todd0000 commented 2 years ago

Question 1: Due to the stepwise comparison of the trajectories, it is not possible to determine a p-value for curves that do not show a statistically significant difference over their entire course.

spm1d does not directly report such a p-value, but it is possible to calculate p-values for p>alpha. A relevant p-value is the one associated with the maximum test statistic value: P(z_max > u). This is the probability used to calculate the critical threshold: set "P(z_max > u) = alpha", then calculate u based on the user-specified alpha. This probability is valid for small p values (approximately p<0.4), but it becomes gradually more-and-more inaccurate as p gets larger. Thus there is no problem for hypothesis testing, which typically alpha values much smaller than 0.4. Although p-values for p>0.5 are generally not valid, they are also not terribly inaccurate.

However, if two curves show a statistically significant difference over their entire course, it is possible to determine a p-value over the entire course with your toolbox.

Yes.

What is the difference between the two cases, so that a p-value can be calculated once and not in the other case?

There is no difference. The former is conventionally not reported in the SPM literature, mainly because (a) the precise p-value is irrelevant when the goal is hypothesis testing, and (b) large p-values are not terribly accurate. spm1d follows this convention. But I see from your comment that it would be useful to include this p-value when the null is not rejected, especially for cases where the maximum test statistic value approaches the critical threshold. In this case I can see that it would be useful to report "p=0.121", for example. Please note that this feature is now at the top of the list of spm1d feature requests to be included in future releases.




Question 2: In one of your recent papers ( Honert, E. C., & Pataky, T. C. (2021). Timing of gait events affects whole trajectory analyses: A statistical parametric mapping sensitivity analysis of lower limb biomechanics. Journal of Biomechanics, 119, 110329-110329) you describe the poor localizing power of time-continuous analyses. This led us to the question of how we should interpret SPM results. As an example, a SPM plot shows a significant difference between two trajectories around 50% of the movement. This should indicate that there is a significant difference between the two trajectories. However, ca we say that this difference occurs around 50%/in the middle of the movement or do you have to say that we cannot localize the difference over the time course?

This is a very good question, and an explanation is subtle but important.

As a short answer: I think the former is fine for reporting purposes. I would recommend against saying that effects cannot be localized because (a) "cannot" slightly overstates the localization issue, and (b) poor localization is implicit in the method.

A longer answer: A significant effect at 50% does not imply that there is a real effect at 50% and no real effect elsewhere. This interpretation is tempting, but is also not entirely correct. It is certainly possible that there is indeed a real effect at 50%, and none elsewhere, but this is just one possibility. Another possibility is that there is a real effect everywhere, and that the sample size was inadequate to detect a real effect everywhere. Yet another possibility is that there is a real effect at 75%, and no real effect at 50%, but due to random sampling an effect was observed at 50% and not at 75%.

The key is that we can never know the truth about real effects vs. no effects. In the world of hypothesis testing, all we can do is make an assumption, then quantify the likelihood of observing an effect as large as the observed effect when that assumption is true. The relevant assumption is embodied in the null hypothesis (H0), and in the two-sample case H0 is: equivalent population mean trajectories. An cluster-level SPM result of p=0.02, for example, is interpreted as follows: if H0 were true --- and there were truly no differences between population mean trajectories, and thus there were truly just a single population with the observed smoothness characteristics --- then the probability that random sampling would produce a suprathreshold cluster as large or larger than the observed cluster is p=0.02.

Note that H0 embodies no localizing information. There are no details regarding where one expects to observe effects. Thus rejecting H0 also embodies no localizing information. Rejecting H0 at alpha implies only that H0 was a relatively poor prediction, but the time point(s) which cross the critical threshold are irrelevant to that H0 rejection decision.

If you were to alter H0 to focus only on time=50%, then this new H0 would have high localizing power, but all other data would be irrelevant to this new H0, so would have to be ignored.

So yes, you can say that the observed difference occurs at around 50%, but the reality is that this is incidental to the H0 rejection decision.

This apparent dilemma between localizing vs. non-localizing is resolved if you regard H0 as an experimental prediction. Rejecting H0 implies that the prediction was poor, and that a new H0 should be devised. From this perspective it is clearer why an effect at 50% is incidental: rejecting H0 anywhere implies that H0 should be revised.

Lucia-Scheffler commented 2 years ago

Hello Todd,

thank you verymuch for your support. You answers were understandable and helped us a lot.

Best regards, Cagla, Felix, Lucy