Reproducing results for PhysionetMI? #640

trialan commented 2 months ago

I'm switching to MOABB for better comparing my results with other papers, so thanks for this project!

Problem: Here we have the paperswithcode leaderboard for PhysionetMI left vs right hand classification. Is the code for these numbers public? I know the numbers are given in the moabb paper but that's not super useful for me.

It's easy to get numbers for, e.g., BNCI_2014_001, as in this moabb docs tutorial. But swapping out BNCI_2014_001 for PhysionetMI breaks this code:

paradigm = LeftRightImagery()
# Because this is being auto-generated we only use 2 subjects
dataset = BNCI2014_001() #swap this for PhsyionetMI
dataset.subject_list = dataset.subject_list[:2]
datasets = [dataset]
overwrite = True  # set to True if we want to overwrite cached results
evaluation = CrossSessionEvaluation(
    paradigm=paradigm, datasets=datasets, suffix="examples", overwrite=overwrite

results = evaluation.process(pipelines)


I get this error:

Exception: No datasets left after paradigm
            and evaluation checks
> /Users/thomasrialan/Documents/code/venvs/eegenv/lib/python3.12/site-packages/moabb/evaluations/
    132             self.datasets = datasets
    133         else:
--> 134             raise Exception(
    135                 """No datasets left after paradigm
    136             and evaluation checks"""

I think this may have something to do with the fact that in PhysionetMI there is one subject who got sampled at a different frequency (I think it's 160Hz for every subject except one), as noted in this issue by @sylvchev .

So is the current status of PhysionetMI broken, or is there something obvious I'm missing? Somehow the numbers for the leaderboard were generated so I imagine it's the latter.

I have downloaded all the data in the relevant mne folder so I don't think that's the issue.

PierreGtch commented 2 months ago

Hi @trialan, could you send the complete script you tried to execute which resulted in this error?

trialan commented 2 months ago

Thanks Pierre, I've made a bit of progress on this (it actually was that I was missing some data). Now the results of this script:

from mne.decoding import CSP
from pyriemann.estimation import Covariances
from pyriemann.tangentspace import TangentSpace
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

import moabb
from moabb.datasets import BNCI2014_001, PhysionetMI
from moabb.evaluations import CrossSessionEvaluation, WithinSessionEvaluation
from moabb.paradigms import LeftRightImagery, MotorImagery

pipelines = {}
pipelines["CSP+LDA"] = make_pipeline(CSP(n_components=30), LDA())
dataset = PhysionetMI()
paradigm = MotorImagery()
datasets = [dataset]
overwrite = True  # overwrite cached results
evaluation = WithinSessionEvaluation(
    paradigm=paradigm, datasets=datasets, suffix="examples", overwrite=overwrite
results = evaluation.process(pipelines)

are consistent with the reported results in Table D1 of the moabb paper.

However I wonder how to get a score for only some categories, as table D1 is for all categories, but the paperswithcode leaderboard is only left vs right hand. So I try to run this script instead:

#every other line the same
paradigm = MotorImagery(events=["left_hand", "right_hand"])

And see this error:

TypeError: '<' not supported between instances of 'int' and 'NoneType'
So in short my question is: how can I get get left vs right hand accuracy for the CSP+LDA pipeline on the PhysionetMI dataset?

PierreGtch commented 2 months ago

For the left vs. right hand results, you should use:

from moabb.paradigms import LeftRightImagery
paradigm = LeftRightImagery()
trialan commented 2 months ago

Right, I wondered if that may be it, but when I run this script

from mne.decoding import CSP
from pyriemann.estimation import Covariances
from pyriemann.tangentspace import TangentSpace
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

import moabb
from moabb.datasets import BNCI2014_001, PhysionetMI
from moabb.evaluations import CrossSessionEvaluation, WithinSessionEvaluation
from moabb.paradigms import LeftRightImagery, MotorImagery


pipelines = {}

pipelines["CSP+LDA"] = make_pipeline(CSP(n_components=30), LDA())

dataset = PhysionetMI()
paradigm = LeftRightImagery()
datasets = [dataset]
overwrite = True  # overwrite cached results
evaluation = WithinSessionEvaluation(
    paradigm=paradigm, datasets=datasets, suffix="examples", overwrite=overwrite

results = evaluation.process(pipelines)

The output is

CSP+LDA    0.566848
Name: score, dtype: float32

Which does not correspond to the 65.74% reported on paperswithcode (this number is exactly the second row of table D2 to quote this number). To be fair the 56.6% is only ~0.5 standard errors (SE=17.37 in table D2 for CSP+LDA) away from the quoted number in the paper. So is this expected behaviour?

I ran it a few times and got these scores: 56.67, 57.09, 57.49. Seems pretty clear that the standard error is not 17.37% so I wonder what is going on here.

I also don't think it's about the number of CSP components as suggested by this paper and my experiments (I'm double checking this is true in moabb too, but I think it will be).

EDIT: Running with n=5 CSP components actually gave closer numbers to that quoted in table D2. I got: 64.74, 64.71, 65.70. Is it fair to assume that the results in D2 were obtained by doing some sort of parameter search and keeping the best? Sort of like what's described in appendix A (table A1)?

Maybe I nee to run evaluation.process instead or something?

PierreGtch commented 2 months ago

You can find the implementation details of the pipelines we used for the benchmark in the pipelines folder. The CSP + LDA one is here: