0todd0000 / spm1d

One-Dimensional Statistical Parametric Mapping in Python
GNU General Public License v3.0
63 stars 21 forks source link

Understanding dependence of critical threshold on DOF #123

Closed schwa021 closed 4 years ago

schwa021 commented 4 years ago

Hello,

Thanks for providing this nice software and helpful manuals/background info./etc...

This is my first foray into using spm to analyze continuous (1d) curves, and I have a question about "general" behavior of the critical values for t-tests.

Below is an example 2020-03-05_11-26-26

The following matlab code indicates the stats I was running

spm = spm1d.stats.ttest_paired(X1, X2);

spmi = spm.inference(.05, 'two_tailed', true, 'interp', true);

Which generated:

SPM{t} inference z: [1×101 double] df: [1 2] fwhm: 6.3796 resels: [1 15.6749] alpha: 0.0500 zstar: 44.9278 h0reject: 0 p_set: 1 p: []

Looking at the pre-post EMG data, I would have "guessed" that there were significant differences near 75% of the gait cycle (maybe elsewhere too). However, the data was not even close to the critical value.

If we re-scale the right hand plot, we can see that the z statistic is "pretty big" (I actually have no basis of reference for gauging how big/small it is - but it seems to be a z-score, so 10 seems big to me).

image

I'm wondering about how the critical value (zstar I think it's called) tends to scale with degrees of freedom. In this example, there are only three trials for each condition. I am guessing that, despite the wide gap (even as a z-score), there is something going on with having such a low dof that makes the apparent difference insignificant.

If this is the case - do you have any generic advice about the N required for convergence of this statistical approach?

I appreciate any thoughts you have or advice you can give.

-Mike-

0todd0000 commented 4 years ago

For all hypothesis tests, the critical threshold (z*) approaches infinity as the sample size approaches zero.

The attached figure shows z as a function of the degrees of freedom (v). For 0D data, z only reaches about t=6 when v=1, but you can see the trend is toward infinity as v decreases.

For 1D data, the random field theory critical thresholds exhibit the same trend, but the thresholds are higher to protect against false positives across the 1D domain. When v=2 and FWHM=6 (similar to the FWHM in your dataset), z*=121.8 (not shown in the figure).

The basic reasons for these trends are that sample mean and sample variance (i.e., estimates of the population mean and population variance) become increasingly poor as v decreases. The critical threshold raises to account for the instability these estimates.

fig

The easiest way to choose sample size (N) is to conduct power analysis. Before the experiment, set the nature of the effect that you wish to detect, and the expected variance in the dataset, or the variance measured in a pilot study. N is computed from the effect, variance and alpha (usually 0.05).

Power analysis is straightforward for 0D data, and a variety of software tools exist for calculating sample size. For 1D data things are a bit more complex because the effect one wishes to detect can have an arbitrary 1D structure. The only tool I know of for arbitrary 1D power analysis is power1d, for which an example sample size calculation is available.

schwa021 commented 4 years ago

Thanks Todd. It makes perfect sense, and is basically what I had figured.

Regarding pre-hoc power analysis - I agree completely. This was simply looking at some retrospective data, so it wasn't an option.

I wonder if you feel there may be some alternative approach for small-sample experiments. I only say this because it seems odd that the data I showed, with z-scores > 10, and an extremely wide "visual" gap, is not even close to significant. Maybe the adjustment is just too conservative with small N. Maybe some sort of Bayesian adjustment allowign for a prior?

Anyway, thanks again for the software - it works like a charm, and for the quick solution to this question.

-Mike-

0todd0000 commented 4 years ago

Hi Mike, you're very welcome, I'm glad you like spm1d!

If you'd like to use hypothesis testing, the only ways around the small sample issue that I know of are: (1) region of interest (ROI) analysis, and (2) nonparametric inference.

ROI ANALYSIS You could reduce the 1D domain into one or more ROIs. As the ROIs get smaller, the RFT thresholds converge to the 0D thresholds (e.g. Student's t thresholds). However, like power analysis, ROIs need to be specified in an a priori manner. Examples of ROI analysis are in: ./spm1d/examples/stats1d_roi/

NONPARAMETRIC INFERENCE spm1d's nonparametric procedures use observation permutation to infer distributions. For two sample t tests with n observations in each of the two groups, the total number of permutations (m) is:

m = (2*n)! / (n! n!)

which means the following:

n m minimum P value = 1/m
3 20 0.05
4 70 0.014
5 252 0.004
6 924 0.001

Thus, with n=3 observations, it is not possible to breach the conventional threshold of alpha=0.05. If you can increase n to 4, then nonparametric inference would work if all of the observations in one group are smaller than all of the observations in the other group, or vice versa.

There are no other hypothesis testing methods that I'm aware of for dealing with small samples. The basic problem is that hypothesis testing methods attempt to infer the population distributions from just a few observations, and n = 3 observations is simply insufficient to confidently estimate those distributions.

Bayesian analysis would indeed work better for small samples, especially when the differences are consistently large for all observations, but there are unfortunately no easy-to-use software tools that I'm aware of for 1D data.

As an aside, I know what you mean by "too conservative", as the thresholds do indeed seem unnecessarily high. For completeness: the random field theory (RFT) thresholds are actually neither conservative nor non-conservative; they accurately control alpha when the data are smooth, Gaussian 1D. When N is small, sample means and test statistics fluctuate wildly, so the thresholds have to be high. Most importantly, hypothesis tests tell us only about the null hypothesis (which is usually: null effect). Even if a sample's effect size looks big, this is irrelevant, because hypothesis testing results tell us only what is probabilistically true when the null hypothesis is true. They do not tell us whether a particular observed effect is true.

schwa021 commented 4 years ago

Thanks again Todd. I agree completely with your analysis and explanation.

When I said the thresholds seemed "conservative" - I did not mean to imply the methods had an actual conservative bias. I was just speaking "informally". As a Bayesian (thinker) with 30 years experience in gait analysis, it is hard for me to look at that Rectus Femoris EMG pattern and not "believe" that the difference I was seeing was real.

Again, speaking somewhat metaphorically (and somewhat technically), the frequentist approach assumes a flat prior. This is what makes it so hard for the frequentist + NHST style approach to deal with small samples. I would assume a rather broad gaussian prior - still centered at zero change, would probably make a big difference.

This is not meant to throw shade at the approach you have taken. It's excellent, easy to use/interpret, and a great contribution to the community.

0todd0000 commented 4 years ago

Agreed regarding the frequentist perspective and its limitations. I hope to incorporate Bayesian inference options in a future versions of spm1d, but this may not be in a manner similar to NHST.

Thank you very much for the feedback!