Centre-IRM-INT / GT-MVPA-nilearn

GT MVPA nilearn from Marseille
3 stars 3 forks source link

More numerous data or more reliable data for MVPA ? #23

Open JeanneCaronGuyon opened 2 years ago

JeanneCaronGuyon commented 2 years ago

Hi all,

It's me again ! So big question on an old topic I guess... Should we prefer more data to more reliable data for MVPA ?

Let me explain: we have been choosing to go with 1 beta per trial to train our MVPA algorithm and test them. Little reminder: for the VisuoTact project we get 7 trials / condition / run, we have 6 runs in total (so 42 trials / condition in total), and use a leave-two-out procedure so we have 4 training runs and 2 testing runs in each split. That's great because it gives us "many" trials to train on. However, 1 beta per trial is probably super noisy.

Now, if we take 1 beta / run (per condition), it leads us to 7 times less data, but each of these "averaged" betas should be more reliable, robust, less noisy. But would that lead to a less accurate definition of the hyperplane that will separate our conditions ?

Below is an image taken from Martin Herbart class

I can see the potential differences, but what would be your wild guess on which to choose ? Why historically did we decide to go for one beta per trial instead of keeping our old good one beta per run GLM ? And maybe more specifically in the context of our study with 6 runs, and 7 trials / condition / run ?

Thanks for your input ! Jeanne

Capture d’écran 2022-03-08 à 14 46 23
SylvainTakerkart commented 2 years ago

Hi, the RMN reminds me that I've never replied... Sorry!

The title you chose for this issue actually summarizes perfectly the dilemna, for which, of course, there's no answer, and as usual, the best solution is usually a compromise between the number of data points and their quality! Your post only forgets one thing, the fact that early on, lots of MVPA studies actually used single TR BOLD images as inputs... So the fact that lots of people are using single-trial beta maps actually represents already a good solution for the looked-after compromise between 1. the single TR BOLD images, and 2. at the other extreme, the "one beta per run" solution.

Also, another element: don't forget that on principle, a classifier (or more generally a machine learning model) will be good when we have a good estimate of the distribution of the underlying data (the separator that is used to classify is only a characteristic driven by the distributions of the two classes)... So even if the algorithm used (to estimate the separator / classifier) does not explicitly try to directly estimate the underlying distributions of the two classes, a proper intuition is that "the more data points the better" and that "estimating the characteristics of the noise , i.e the distribution around the mean, is probably good for you"... [but, as you say, if the data is too noisy, it might be a mess ;) ]

Finally, the transition from "single TR image" to "single-trial beta maps" has another advantage: it gets rid of the strong correlation in the input data if you use "single TR images" as inputs! Since for training ML models, having independant data points is of great importance, the "single-trial beta maps" is THE solution that give you the most numerous data points while ensuring (approximately) the independance of your observations (and on top of this, you get rid of a good amount of noise that's present in "single TR images".

SylvainTakerkart commented 2 years ago

In a pragmatic manner, if you want to experiment with this looked-after compromise, this could be done this way: