TNTLFreiburg / braindecode

Outdated, see new https://github.com/braindecode/braindecode
BSD 3-Clause "New" or "Revised" License
322 stars 155 forks source link

Experimental results for dataset BCI IV 2b #45

Closed erap129 closed 5 years ago

erap129 commented 5 years ago

Hello, according to table VI (Decoding results for additional datasets) in your paper, The deep ConvNet should reach a kappa metric of 0.598 and the shallow ConvNet should reach 0.899. I tried replicating this result, both with my implementation and by replicating your example file "bci_iv_2a.py" and "bci_iv_2a_cropped.py". To do so I created my own Monitor class which measures the validation kappa value, and similar to your NoDecrease class I created a NoIncrease one (all this is just to use the validation kappa for early stopping). Using the deep architecture with cropped training I get an average test kappa of 0.566, and with the shallow architecture I get 0.545 (nowhere near the reported 0.899) I also found I have to change the function 'setup_after_stop_training' in experiment.py. There is a reassignment there to self.stop_criterion, and I found that if I don't change it the training will run for 1-3 epochs in the second phase and stop, giving me a test kappa of 0. I made the self.stop_criterion same as the one defined in the main example file, only with a higher max_epochs and max_increase_epochs count. On top of that - the results for the shallow network are suspicious, all subjects except subject 1 receive the exact same maximum kappa. For the deep network, it is more diverse but still, values tend to repeat.

Am I missing anything? I would like to replicate the results for this dataset but so far haven't succeeded.

Additionally - Are the results reported the maximum value of the 'epochs_df' variable in the respective column? Or is the last value taken? From my runs usually, the last value isn't necessarily the highest measured.

Thank you very much, Elad

robintibor commented 5 years ago

First, just to make sure, you are now working on BCI Comp IV 2b not 2a, right?

1) Stuff about shallow always being same is very strange to me, really should not be like that 2) Stop criterion yeah hm it is a bit overcustomized at the moment probably. We have mostly switched to a single-phase training with cosine annealing nowaways.

About kappa computation: I don't know how you compute kappa, but it is not simply kappa over trials(!) for this competition. My understanding of how kappa was computed (independently for each subject) for this competition is: 1) From per-timestep predictions for each trial (for entire 4 sec) compute kappa independently over trials for each timestep(!). So you will have a kappa"timecourse": kappa value for first timestep, for second timestep etc. 2) Use the maximum (!) kappa value from this timecourse

I gathered that this is the way it was done from:

1) http://bbci.de/competition/iv/desc_2b.pdf See p.5, evaluation (same also for dataset 2a!). Although it still remains a bit unclear here, what it means with "with the largest kappa value" 2) https://www.researchgate.net/profile/Cuntai_Guan/publication/40450440_Multi-class_Filter_Bank_Common_Spatial_Pattern_for_Four-Class_Motor_Imagery_BCI/links/561cd9b908ae6d17308d3a9d/Multi-class-Filter-Bank-Common-Spatial-Pattern-for-Four-Class-Motor-Imagery-BCI.pdf (competition winners). See p. 573, figure 3. Here it is quite clear in my view. This is on 2a, but as written above 2a/2b follow same evaluation guidelines. 3) BioSig source code. As far as I recollect I checked my understanding is correct by looking in the source code directly where the evaluation code was still present. Feel free to recheck.

erap129 commented 5 years ago

Thanks for the quick reply. So yeah I basically calculated the kappa over trials (320 test trials so kappa evaluated from 320 "true" labels and 320 "predict" labels). I'm having trouble understanding the per-timestep prediction. The input to my CNN is the size of a training sample (batch size X 3 channels X 1126 samples). Am I supposed to somehow split this into windows and calculate the kappa for each window separately? Or for each time step? (how is that possible?)

Also, I used the Moabb python package to load the data - http://moabb.neurotechx.com/docs/generated/moabb.datasets.BNCI2014004.html#id2 I'm guessing you used the raw GDF files from the official competition site, maybe this also affects my inability to achieve the "right" performance? And if so, is there any chance that you wrote pre-processing code for this dataset that is shareable? (Fully understandable if not, but if not - is the preprocessing stage similar to dataset 2a? So I can implement myself)

Thanks again

robintibor commented 5 years ago

First, I noticed another thing, shallow values are wrong in the table 6 of the paper, they are correct in the text of the paper "For BCI competition IV dataset 2b, deep ConvNets reached a mean kappa value of 0.598, almost identical to the FBCSP competition results (0.599), and shallow ConvNets reached a slightly better kappa value of 0.629." (so shallow should be +0.03 not +0.3 in the table)

I had sent a mail to Human Brain Mapping end of 2017 but something seems to have gotten lost in the communication, I have reasked them now.

I had a look at my ancient code to see how I did the preprocessing. I don't recommend spending too much time to look in that since it is very messy code. Still for future reference also for me: https://github.com/robintibor/braindevel/blob/21f58aa74fdd2a3b03830c950b7ab14d44979045/braindecode/configs/sacred/bcic_iv_2b.py#L200-L238 (preprocessing for 2b) https://github.com/robintibor/braindevel/blob/21f58aa74fdd2a3b03830c950b7ab14d44979045/braindecode/mywyrm/clean.py#L225-L244 (artefact mask usage for 2b) https://github.com/robintibor/braindevel/blob/21f58aa74fdd2a3b03830c950b7ab14d44979045/braindecode/datasets/loaders.py#L473-L495 (2b loading code)

So, preprocessing seems identical to 2a, plus I remove trials that were marked as artefact in the gdf file to stay consistent with the competition/the kappa values I compared to.

For how to get per-time-step predictions, in my cropped decoding you can get them quite naturally. For other methods/ConvNets you would have to have give your ConvNet inputs that end at timestep 0, end at timestep 1 etc. Compute all the predictions like that and then compute the kappa for each timestep. Keep in mind, prediction for timestep k means data can only be from timesteps <= k:

All algorithms must be causal, meaning that the classification output at time k may only depend on the current and past samples xk, xk−1, . . . , x0. (http://bbci.de/competition/iv/desc_2b.pdf)

erap129 commented 5 years ago

Thank you very much! Very helpful.

robintibor commented 5 years ago

Great! Can you let me know when you have results @erap129 ? I actually never ran on this dataset again since the reimplementation of braindecode in pytorch :)

erap129 commented 5 years ago

My method currently yields 0.52 Kappa, but: 1 - still pretty sure I'm not doing the evaluation 100% correctly, will look into it further. 2 - still debugging an improvement (I hope) on the way. :)

robintibor commented 5 years ago

your method meaning what exactly? :)