Number of permutations in non-parametric spm1d models

zof1985 commented 7 years ago

Hi Todd,

I read the Nichols and Holmes (2002) paper as I am interested to apply a non-parametric anova3rm to accelerations data. However, I'm a little bit puzzled with the selection of the number of permutations to be performed. With my data, the number of total possible permutations is really very high (i.e. it would take ages to perform the inference of even one single anova). Therefore, I would like to know if you are aware of rules of thumb or peer-reviewed references which may suggest an appropriate number of permutations according to the size of the data.

Many thanks, Luca.

0todd0000 commented 7 years ago

Hi Luca,

As a rule-of-thumb I'd suggest a minimum of 1000 permutations, with 10,000 permutations being safer. 100,000 permutations is probably OK but seems a bit high, and 1,000,000 is almost certainly overkill.

Perhaps more important than the number of permutations is to make sure that you run the analysis a number of times (e.g. 10 x 1000 permutations or 10 x 10,000 permutations) with different random number generator states, then check to make sure that the results are not qualitatively affected by the random state.

I'm not aware of any specific recommendations regarding the choice of the number of permutations. There may be some published suggestions out there, but as long as you use at least 1000 and also conduct a sensitivity analysis (like above, by re-running for a number of different random states), then I don't think any reviewer would complain.

Please let me know if you're able to find any published recommendations.

Cheers, Todd

zof1985 commented 7 years ago

Hi Todd,

Does the non-parametric .inference() function randomly generate the numbers for the permutations? Otherwise, is it possible to enter them manually? E.g. "spm1d_object.inference(alpha, nPermutations, permutationList)

Cheers, Luca.

0todd0000 commented 7 years ago

Hi Luca,

Yes, spm1d's non-parameteric inference procedure randomly generates the permutations, but only if the total number of permutations (N) is greater than 10,000. If N < 10,000 it iterates through all permutations automatically. When N > 10,000, it generates the permutations using np.random.permutation as follows:

np.random.permutation(J)

where J is the total number of observations.

To control the random state use np.random.seed, for example:

np.random.seed(0)

Changing the seed and repeating the analysis will yield a type of sensitivity analysis (the one discussed above).

It's not possible to submit your own permutation list to the procedure, but that's a very good idea! You could achieve the same result by generating the list on your own, then cycling through the list, computing the t (or F) statistic each iteration. In fact, that's probably a good exercise: if your data are stored as an (J x 101) array Y, then generate a random sequence of labels using np.random.permutation, then running ANOVA normally, and sequentially as follows:

np.random.seed(0)
i = np.random.permutation(J)
Y_permuted = Y[i]
FF = spm1d.stats.anova3(Y_permuted, A, B, C)

i = np.random.permutation(J)
Y_permuted = Y[i]
FF = spm1d.stats.anova3(Y_permuted, A, B, C)

Storing the maximum F value for each iteration (and each of the seven effects separately) should yield the same results as spm1d.stats.nonparam.anova3.

Todd

zof1985 commented 7 years ago

Many thanks Todd.

zof1985 commented 7 years ago

Sorry Todd,

I just reopened this issue to ask you about the multiple test adjustments for non-parametric anova. Since each permutation requires a new test, performing 10,000 permutations and taking the maximum F-value would also require to correct alpha for the number of permutations? Using the available Bonferroni correction this would lead to an alpha value very very very close to zero.

Thanks, Luca.

0todd0000 commented 7 years ago

Hi Luca,

No problem at all. There is actually just one test despite many iterations of test statistic (F) calculation.

The number of tests is equal to the number of time probabilistic inference is performed (i.e. the number of times p values are calculated). Calculating the F statistic value for each of the permutations does not involve any probability calculations. The purpose of the permutations is to iteratively generate a numerical approximation to the true, continuous probability distribution. In this case the probability distribution represents the behavior of the one-dimensional F statistic's maximum value (Fmax) under random permutation. If the data are normally distributed, that numerically estimated Fmax distribution will converge to the parametric Fmax distribution as the number of iterations approaches infinity.

Using 1000 permutations will generate a reasonably good approximation to the true Fmax distribution. 10,000 permutations will generate a slightly better approximation. 100,000 or 1,000,000 permutations will be marginally better, but numerically there will be very little difference in probability values, so 10,000 permutations is probably enough.

After the Fmax distribution is estimated, inference is conducted just once. That is, the critical F threshold is computed just once using the numerically-estimated Fmax distribution.

Cheers, Todd

zof1985 commented 7 years ago

Thank you Todd,

I have one further question: How much time can require the calculation of 10.000 permutations for a 2-way repeated measures ANOVA? I'm analysis a 3780 x 101 matrix defining 30 repeated measures of 6 testing conditions for each of my 21 participants. I have an i7 cpu with 16Gb RAM but after 1 hour the .inference() method is still running. Thus I would know if this is normal or there is something wrong?

Many thanks, Luca.

0todd0000 commented 7 years ago

Hi Luca,

It's difficult to say how long it would take, but I'd suggest running the analysis using just ten iterations, timing the duration, then multiplying by 1000 to give a rough idea of how long 10,000 iterations will take. spm1d's current non-parametric procedures are not optimized, so unfortunately require a bit of time. In spm1d's example scripts you should see that smaller datasets (i.e. on the order of 100 observations) will likely finish running in 30 s or so.

You might want to try computing within-subject means to reduce the dataset by a factor of 30, and thereby reduce calculation times by a factor of probably 30x30=900 (or more). You may also want to try the analysis for a single time point, just to make sure everything is working, and to compare parametric and non-parametric results for a single point.

Todd

zof1985 commented 7 years ago

Many thanks Todd

0todd0000 / spm1d

Number of permutations in non-parametric spm1d models #63