0todd0000 / spm1d

One-Dimensional Statistical Parametric Mapping in Python
GNU General Public License v3.0
61 stars 21 forks source link

P values and effect sizes #48

Closed Kneerav closed 8 years ago

Kneerav commented 8 years ago

Hi Todd,

Just had a couple of questions:

  1. Is it possible to return a p-value for a test statistic (e.g. F in repeated measures ANOVA) where there are no clusters? I only ask because generally speaking, I would prefer to report the actual p value wherever possible e.g. p=0.58 vs p >= 0.05. Is that possible?
  2. What is the best way to go about returning/reporting an effect size? Obviously, the SPM graphs show the test statistic across the waveform, but from a publication standpoint, these SPM graphs may be moved to supplementary material, and I would like to include some measure of the effect magnitude (in some way) within the main manuscript. I have thought of the following options:

a) Take the peak F or T statistic for a given cluster b) Take the average F or T statistics for a given cluster c) Report the F or T range (the lower end being the critical threshold, upper end being the peak) d) Compute a paired Cohen's d (for my post-hoc paired t-tests) for each node across the waveform, and then take either the peak, average, or range (within the suprathreshold cluster period) as described previously.

Is there any particular method you see as the most valid and/or acceptable?

Cheers, Nirav

0todd0000 commented 8 years ago

Hi Nirav,

Excellent questions.

  1. Yes, precise p values for p > alpha can indeed be computed, but there are two caveats. (Caveat 1) Since there are no supra-threshold clusters the p value would pertain only to the maximum test statistic value across the whole 1D continuum. (Caveat 2) Random Field Theory (RFT) solutions are generally validated only for low probabilities (usually p<0.4 or p<0.3). The RFT solutions may indeed be valid for all probabilities between 0 and 1, but since they generally have not been validated a convention of "p > alpha" emerged in the SPM literature. Both caveats aside, the p values we're talking about can roughly be computed using the rft1d.teststat.sf functions like this:

    import rft1d
    
    maximum_f_value = 5.0  #maximum f value across the 1D continuum
    df = (2, 14)  #degrees of freedom
    nodes = 101   #number of continuum nodes
    fwhm = 10.0   #1D smoothness
    
    p_value = rft1d.f.sf(maximum_f_value, df, nodes, fwhm)  #in this case: p = 0.391
    

    Note that, while the rft1d.teststat.sf results are approximate, I've done some partial validations of large p values using t statistics, and they generally seem quite good, with perhaps some minor discrepancies for very large p values like p=0.9. In the next version of spm1d we'll try to include some more details about this, preferably with validated p values when p > alpha.

    For now, if you wish to report these p values in a paper, I'd recommend being cautious. For completeness you could write something like: "The null hypothesis was not rejected at alpha=0.05, and the probability that 1D Gaussian continua with the same smoothness would produce an F continuum with the same maximum value is approximately p = 0.391. This solution is approximate pending numerical validations of large p values (Friston et al. 2007) which have not yet appeared in the literature."

  2. I think the best way is with unfortunately with the SPM graphs. The fundamental problem with reporting F or t values, and even with reporting Cohen's d, is that sample sizes (degrees of freedom) curb their probabilistic meaning. Thus t=3.0 in one dataset may be probabilistically equivalent to t=4.0 in another dataset. With the SPM graphs the critical threshold anchors the results in probability theory, where the critical threshold has the identical probabilistic meaning across all experiments. That being said, I think all of the possibilities you mention are fine provided they are used for comparing datasets ONLY when the degrees of freedom are identical or at least quite similar.

As an aside, I fully appreciate the problem of how to describe the results. My main advice, based on the Neuroimaging literature, is that SPM results transcend the convention of text descriptions to a certain extent because the graphs depict results that text descriptions can never fully describe. In the Neuroimaging literature effect sizes and most other statistics are reported at the continuum level, in that case as a 3D continuum within the brain, and these are described in the text qualitatively, often augmented with cluster-specific p values.

Cheers,

Todd

Kneerav commented 8 years ago

Hi Todd,

Thanks for that thorough reply. All makes sense to me. As a quick follow up, normally an ANOVA is what provides the justification for post-hoc pairwise comparisons. So my question is, if an ANOVA returns a suprathreshold cluster at 10-30% of a gait cycle for example, and post-hoc t-test shows a cluster at 5-40% for a pairwise comparison (even if Bonferroni corrected), are we to ignore everything below and above 10 and 30% respectively?

Cheers, Nirav

0todd0000 commented 8 years ago

Hi Nirav,

A somewhat related discussion arose elsewhere regarding interpreting temporal bounds of suprathreshold clusters (Issue #41). To summarize that discussion in the context of your question: disagreement between ANOVA and post hoc results can arise from simplistic post hoc procedures, and we hope that future versions of spm1d will provide closer agreement between the two. Nevertheless, even with more accurate post hoc procedures I think the interpretation would be the same: SPM and all other classical hypothesis procedures describe the behavior of random data, not true effects. Thus I wouldn't recommend ignoring disagreements between ANOVA and post hoc analysis, I'd instead recommend regarding both as approximations to potentially true effects.

Todd

Kneerav commented 8 years ago

Hi Todd,

Again, a very thorough response, so thanks for that. Sort of in line with the limitations of current post-hoc testing, I have been running paired-t-tests as my post-hoc. With ten pairwise tests, I have applied a Bonferroni correction. Initially, I had the alpha level set to 0.05, yet only accepted clusters in which the p value was less than 0.005 (admittedly, I did this without giving the issue much thought). However, recently I re-did this with the alpha level set to 0.005, and noticed the results were different from the original method. Just curious as to why these differences occur, and which would be more valid in the current context?

Cheers, Nirav

0todd0000 commented 8 years ago

Hi Nirav,

Field-level and cluster-level probabilities are different so there can sometimes be be discrepancies between field-level and cluster-level results. The two probabilities can be interpreted as follows:

To understand how these can differ consider a hypothetical case where you have very large cluster that just barely exceeds the critical threshold at alpha=0.05, but which doesn't exceed the threshold at alpha=0.04. The cluster's p value will be very small (p<0.001) for the former case, but will be undefined for the latter case. Thus the relation between field-level and cluster-level probabilities is complex; since field-level probabilities depend only on height, but cluster-level probabilities depend on both height and extent, it's generally not possible to find a cluster-level threshold that is uniquely related to a field-level threshold.

When conducting post hoc analysis it's best to adjust the field-wide threshold and not the cluster-specific threshold because the null hypothesis pertains to the whole field and not to specific clusters. The clusters are unknown prior to conducting the experiment, and it's scientifically invalid to adjust one's hypothesis regarding a particular dataset after observing the contents of that dataset.

Alternatively, it is perfectly acceptable to focus the hypothesis on particular field regions (e.g. 0 to 20% time) in an a priori sense, but in that case only that particular time window should be analyzed and the rest of the field should be ignored.

Cheers,

Todd

Kneerav commented 8 years ago

Ahh I see now, I'll go with the field-level adjusted method as suggested (it made more sense to begin with). You can probably close this one off.

Thanks again, Nirav