Bug treating two-sided tests (or maybe other non F stats?)

NirOfir commented 3 years ago

Hi, thanks in advance for your precious time:

I'm testing an interaction of a categorical variable and a continuous variable, by using "split regression" and a contrast (something like [0 0 0 1 -1], I hope this makes sense) and a one sample t test. The mean beta timecourses were pretty striking (and it made sense from the literature) so I was surprised that nothing survived the cluster bootstrap correction. Here are the kernel density estimates of the H0 bootstrap (red) and observed contrast (blue). They look fine, and there is a prominent negative lobe for the observed contrast: This convinced me further that there is a true effect, so I looked into the code more closely, and I saw that in limo_cluster_test() the observed cluster sums are only saved if they're larger then the largest H0 cluster. This makes perfect sense for F stats, but when looking at t, this means that significant negative t's are disregarded. I patched two abs() there as a bad local fix - I remember for two-sided tests, negative and positive clusters need to be treated separately.

I think this issue actually is more widespread. For instance, the plotting functions also rely on the max of the (masked) statistics, which will end up choosing the wrong timepoints/channels if we are looking at negative effects.

What do you think?

NirOfir commented 3 years ago

And another question (out of curiosity) about limo_cluster_test(): Line 67 convert the mask to a logical matrix from double, but when it's used in line 72, it is used with element-wise multiplication and not indexing. Is multiplication faster than indexing? Also, if the mask is logical, I think using any(mask2(:)) in the if in line 71 will be faster than sum(mask2(:))>0, though I don't believe it will be much more than a few ms difference at best :)

It does seem to be the case in this test:

mat = logical(randn(1000));

mat2 = randn(1000);

log_any = @() any(mat(:));
log_sum = @() sum(mat(:)) > 0;

t_any = timeit(log_any);
t_sum = timeit(log_sum);

log_ind = @() mat2(mat);
log_mul = @() mat2.*mat;

t_ind = timeit(log_any);
t_mul = timeit(log_sum);

CPernet commented 3 years ago

we pass along t^2 (=F) values so we do not have the issue you mentioned -- note all those have been tested and validated under H0 giving an exact type 1 error rate (pushed updated now https://github.com/LIMO-EEG-Toolbox/limo_tools/commit/8d9c2b989578871f9896fe1a307edab1b2b790d6#diff-c304b60c18e91918c9a9be4e80394e42bfcc8b40a03fb3c99bd32364fcc61daf)

CPernet commented 3 years ago

any might indeed be faster than sum - any speed improvement is good

NirOfir commented 3 years ago

Thanks, I'll make a PR for the small speed changes. About the t^2, the cluster finding function does work on the parametric p-values (at least that's what I understand from line 33 in limo_cluster_test.m), so that quantity is nonnegative. But the rest of limo_cluster_test relies on the summed statistic (line 54). When I put a debug point there, ori_f contains negative values too... I think I will need to dig deeper to see where the transformation to t^2 happens. Do you have a pointer for me? Also, if we use t^2, there theoretically is an unlikely option for two adjacent observations, one with a significant (in terms of cluster-forming threshold) negative t and one with a significant positive t, that will be lumped into the same cluster, right?

CPernet commented 3 years ago

oh crap - all is called in limo_stats_values which has been recently updated as it was growing out of hands

CPernet commented 3 years ago

here is the ref to take it from https://github.com/LIMO-EEG-Toolbox/limo_tools/blob/v2.0/limo_stat_values.m#L410 and fixed here https://github.com/LIMO-EEG-Toolbox/limo_tools/blob/HotFixes/limo_stat_values.m#L245

CPernet commented 3 years ago

@NirOfir please confirm this - I actually already merged to master as this is a big bug do put your name down into contributors and I'll make a new release making sure users get notified! thx

NirOfir commented 3 years ago

Thanks a lot! I hope I'll manage to test it today. If not, the next time I'll be next to a computer will be in a week and a half.

NirOfir commented 3 years ago

OK, that helps a lot. However, I think I might be seeing an "artifact" of the squaring: durXresp cluster corrected See the 3 positive observations for O1 around 130 ms. I think that we wouldn't want a cluster to cover both negative and positive effects. What do you think?

Unrelated question: To which branch should I make the small speed PR? Hot fixes?

CPernet commented 3 years ago

yes PR to hotfixes, thx as for +/- because we use t^2 it cannot tell the difference so I think this is ok (in means you have eg a big difference over eg N1 and P2 with no breaking point - but usually clusters break because of the 0 crossing)

I do have some half baked code for something related to break down huge clusters (not public) - email me if you want to collaborate on that (and have some time, not massive to do though)

NirOfir commented 3 years ago

I made a PR for the speed improvement.

As for the +/-, I understand why usually clusters will break (both over time and space changes should be smooth relative to our temporal/spatial sampling frequency), but I still think I get such a combined cluster in my case. This is represented by the three red ticks in this figure (highlighted in the red circle), correct?

durXresp cluster corrected (copy)

CPernet commented 3 years ago

this is a edge case but yes you are right (but as mentioned I do have some half baked code for something related to break down huge clusters (not public))

CPernet commented 3 years ago

and the fix I pushed work right?

NirOfir commented 3 years ago

Oh, now I understand. Thanks. The fix solves the specific issue of the significance and masking. It does not solve another issue: The topography and timecourse are chosen using the maximum of the original, and not squared, statistic. You can see in the figure I posted that they are clearly not representative of the actual cluster.

CPernet commented 3 years ago

the fix I posted use the square --> https://github.com/LIMO-EEG-Toolbox/limo_tools/blob/HotFixes/limo_stat_values.m#L245 of course if you call the function directly, it's up to you to square values

NirOfir commented 3 years ago

Hi, sorry for taking such a long time to answer. The "bug" is still there: limo_display_results still chooses based on the original t stat. I think the reason is that the squaring happens only directly before the clustering function inside limo_stat_values(), so that when limo_display_results() is called, it uses the original stats, and not the squared which were used for clustering. Actually, I would prefer the plotting function to still display the original stats, but choose based on the p-values. This way you would see both the magnitude and sign of the largest effect. What do you think?

CPernet commented 3 years ago

ok that's not a big -- the clustering does what it is supposed to, the display put together positive and negative in the same cluster

I much rather leave that so users are aware of this -- maybe make a wiki page to explain that behaviour

NirOfir commented 3 years ago

I see, thanks. I would like to write some code to do this the negative/positive clusters way (so that you can also test one-tailed), and for that I want to create a separate branch (I will do that in the most encapsulated way I can, so I don't mess up other stuff). Because hot fixes includes a couple of commits that are relevant, I would like to ask you to merge it into master. When you'll merge it, I'll open a new branch and my fork and get going. Many thanks for you patience.

CPernet commented 3 years ago

one tailed what? your zero crossing is function of your reference, to me this has no meaning - you would test only positive, then only negative - thus increasing your type 1 error ; the only way to control the type 1 error is for the whole space simultaneously

now if you wanted to test only positive, all you have to do is threshold your data ; and we can right a code for that but not a code that does the above (because the error rate would be wrong)

fieldtrip for instance does cluster of abs(positive values)+cluster of abs(negative values) ; this means you still end up summing positive and negative clusters (like in https://github.com/LIMO-EEG-Toolbox/limo_tools/blob/master/limo_tfce.m#L391) which we showed to be similar to using squares

NirOfir commented 3 years ago

Thanks for the quick reply! I meant a one-tailed hypothesis test. Looking at FT code, it looks to me like the negative and positive clusters are treated separately consistently along the way, and then either the p values or alpha are multiplied so that the error rate is appropriately controlled (this is explicitly stated in https://github.com/fieldtrip/fieldtrip/blob/c25f1ed286bcd2e7e4472bb3848cead383c06a02/ft_statistics_montecarlo.m#L390). I think what I mean is exactly what you mentioned in the bullet: Separate the data to negative and positive, and compare each to the right threshold, while making sure to correct the p values or alpha, and then the clusters aren't mixed. Is that correct, or am I missing anything else?

CPernet commented 3 years ago

so there is just the need of a wrapper really - take all negative values (observed and null) and apply limo code at alpha/2, do the same for positive values, and put it together?

NirOfir commented 3 years ago

Yep, pretty much. I think I can do it all inside limo_clustering(), without changing limo_findcluster() or limo_cluster_test(). Actually, because I don't need those, I think I can open a branch and start working now, and there shouldn't be conflicts with merging hotfixes later... I'll start with it now and let you know!

CPernet commented 3 years ago

maybe, i wonder if starting from limo_stat_values would not make more sense (as another/alternative way to test than t^2)

NirOfir commented 3 years ago

Hmmmm... I think not, because the bootstraping happen inside limo_clustering(), and I would prefer to use the same bootstraps for the negative and positive H0 clusters. But I'll think of a way to allow the t^2 solution too in limo_stat_values().

EDIT: Sorry, you're right, I should probably start from limo_stat_values :)

EDIT 2: OK, since there wasn't a separation between cluster-forming threshold and critical value for the corrected p values (I need this separation to implement the separate negative/positive clusters code), and because this separation is also relevant for limo_cluster_test() which was changed in hotfixes, I need to ask you to merge it into the master branch, otherwise we'll get conflicts...

CPernet commented 3 years ago

the bootstrap gives you all t and p values - the clustering uses a threshold to create clusters, you therefore start from the bootstrapped null data in limo_stat_values, no need to merge to master - branch from hotfixes

NirOfir commented 3 years ago

That is true, but I don't think its enough - Right now limo_cluster_test() uses the same threshold to find which of the clusters are significant (after the clusters were formed). When I separate into negative and positive clusters, I want to find the clusters using threshold of 0.05 (both in the observed and null data), and then check if the most extreme observed cluster is less than 0.025 probability in the null clusters distribution. I don't see how I can do that without editing limo_cluster_test() a bit, because it uses the same threshold to form the clusters in the observed data, and to check whether the observed clusters are significant.

CPernet commented 3 years ago

seems to me that from limo_stat_values you call twice posM = M.(M>0); posbootM = bootM.(bootM>0); [maskp,cluster_pp,max_thp] = limo_clustering(posM,P.(M>0),posbootM,bootP.(bootM>0),LIMO,MCC,p/2,fig) negM = M.(M<=0); negbootM = bootM.(bootM<0); [maskn,cluster_pn,max_thn] = limo_clustering(negM,P.(M<=0),negbootM,bootP.(bootM<=0),LIMO,MCC,p/2,fig)

although you are right it would be more efficient to 1 - put an argument/option in limo_stat_values to do that 2 - run the same as proposed above in the same bootstrap loop (avoiding calling twice) = in limo_clustering, pass an argument saying you want to split rather than use it all and call limo_cluster on the positive only and then negative only

NirOfir commented 3 years ago

Thanks for the help! This is basically what I was trying to do now, however I noticed that the "p" argument that is passed to limo_clustering() (and on to limo_cluster_test() inside) controls both thresholds: The one forming the clusters themselves using the uncorrected p values, and the one thresholding the cluster statistic based on the permutation distribution. I thought we should maybe separate them into 2 arguments "cluster_th" and "alpha_v" or something of the sort. To make sure it doesn't mess up the way the rest of the code is written, I thought to have alpha_v default to equal cluster_th, so it can be left empty without an issue.

I also worked on the second solution, but the limo_stat_values() solution seemed somewhat easier for now.

CPernet commented 3 years ago

nope you need to use the same thresholds when using cluster mass or it doesn't work -- we tested when preparing the cluster paper (eg form at 0.01 and test at 0.05 -- the type 1 error is then totally wrong)

note for non cluster mass approaches it (kinda) works because you use different properties - see Eklund paper for fmri

NirOfir commented 3 years ago

Huh, wow thanks, that's really interesting to know. I remembered from Oostenveld & Maris 2007 that the 2 are practically independent, but they really didn't have simulations, and I don't remember they talked too much about how it relates to different cluster stats... Thanks for filling a major gap in my knowledge! That makes it simpler then :) By the way, the paper you mean is the 2015 one?

CPernet commented 3 years ago

yep, but i don't think we put that in the paper (you know negative results - while indeed this is massive ; nowadays I'll put in appendix for sure - maybe it's in the figshare https://figshare.com/account/projects/1277/articles/1008311 but it doesn't look like it)

NirOfir commented 3 years ago

I see... The figshare link gives me an empty page. Anyway, I'm almost done. I did find one more thing though: We need to switch to 1 the uncorrected p values for the time/space points we don't want to cluster. Otherwise, if we multiply by M > 0, then all those p values where M < 0 are switched to 0 and will necessarily be included in the clusters... I think that might be one of the reasons FT uses the critical value of the statistic when forming clusters instead of the uncorrected p values.

NirOfir commented 3 years ago

OK, I'm done! Here are the results of the new code. Basically the same except that now no positive samples survive the correction: two_tailed

I also found another small issue, but I feel solving it is much too complicated for me now: limo_color_images() uses the plotted data to define the colormap. I kind of like the approach, but I think that in this case it's not great, because I find it more reasonable to choose colormap based on what the possible values are, instead of what they just turned out to be. For t stats, I would always use a divergent scheme, even if the actual data only had positive/negative values.

Anyway, I just made the PR. I hope that the code I tried to clean isn't going to mess up other things.

CPernet commented 3 years ago

you should be able to call it to change the colorscale - or maybe we can make a new function to call in command line limo_color_rescale() ? as to have the main figure display updated, what do you think of that

NirOfir commented 3 years ago

I see limo_color_images() only uses the stat values map to choose colors, so I don't think I can use it to update the colormap... Adding another function is possible, but it was easier for me to just write a couple of lines into the script as an ugly patch.

I was thinking of something more ambitious, but I'm not sure how practical: Right now, LIMO uses all sorts of bits of info that are collected from different places to run (i.e. deciding how to load the data based on the file name in limo_stat_values()), and then the different choices are never really collected anywhere, and it's not trivial to reconstruct the pipeline after running it. For example, I don't think you can know for what certain what was the threshold after analysing, because as far as I can tell it only happens internally in limo_random_select(). If we can make the LIMO structure into a more general structure, all of those things can be collected there (maybe with a "test" field, that will be a struct array, with a line for each test?). This might make some of the functions easier to handle too, because they will just rely on the properties of the LIMO object. Actually, I really think of it as sort of an OOP issue. But that is a ton of work, and will take a lot of time to implement...

CPernet commented 3 years ago

validation needed for merging the branch? meeting nedded to discuss how to do it

NirOfir commented 2 years ago

Yes, that sounds right. Sorry for disappearing, I'm busier than usual now, but I hope I can get back to you at some point to wrap this up. Thanks for all the help!

CPernet commented 2 years ago

I prepared some test cases to simulate (on paper) for a single channel -- and will implement that for testing.

Your case was initially driven by a case of spatially different effects, all disappearing because using T^2 right? can you explain that again so I can also simulate this - and then we can compare T^2 vs +/- testing assembled and make recommendations

NirOfir commented 2 years ago

Actually, the t^2 worked just fine in terms of finding effects (it found clusters that were practically identical). However, some of the cluster contained positive and negative t values, which I didn't "like", because to me they seemed to me qualitatively different things that Should not contribute to the same cluster.

LIMO-EEG-Toolbox / limo_tools

Bug treating two-sided tests (or maybe other non F stats?) #75