eeyhsong / EEG-Conformer

EEG Transformer 2.0. i. Convolutional Transformer for EEG Decoding. ii. Novel visualization - Class Activation Topography.
GNU General Public License v3.0
389 stars 55 forks source link

Result About the Dataset I 's Accuracy #2

Closed wangbxj1234 closed 1 year ago

wangbxj1234 commented 1 year ago

I have run this code with your preprocessing code of EEG-transformer (getData.m) . However, each subject's result is not as good as your paper's. If this time you have got a new way of preprocessing the data, could you please tell me the difference? Thank you very much.

eeyhsong commented 1 year ago

Do you use standardization in preprocessing? You can refer to the original paper. It is not hard to get fine results for BCI competition iv datasets 2a with this method.

wangbxj1234 commented 1 year ago

I see. Is the acuuracy in the table 2 is the 'best accuracy' of your code , not the 'average accuray', then my reproduced result is almost the same as table 2.

martinwimpff commented 1 year ago

The major issue here is that the end-result relies heavily on the exact checkpoint. I investigated the test accuracy of the model at every epoch and observed substantial oscillations (>+-10%). In the paper it says training runs for 2k epochs then the model gets evaluated. But what you really do is you pick out the best accuracy along the way (i.e. the pick the best checkpoint).

The problem here is that you pick the epoch/checkpoint based on the test accuracy. Therefore your test accuracy is not independent anymore. If you would pick the checkpoint based on train or validation metrics this procedure of picking the "best" checkpoint would be fine.

eeyhsong commented 1 year ago

Hello @martinwimpff Sincerely thank you for your suggestion. You said it, we need to get the test results by saving the checkpoint with the best validation loss/acc. It's definitely not rigorous to test in this manner. Actually, it gives 'optimal' results of the test data with the method. So I prefer to call it hold-out validation as in README. Besides, I think it is better to separate the train and test sessions in this kind of dataset than shuffle sessions before tests. (Not an excuse)

Thanks again, and I will correct it in future works. 🤝

martinwimpff commented 1 year ago

@eeyhsong thanks for the quick response.

Just to make my point clear: 1) You should always split the sessions in two separate sets (as you did) 2) To produce reliable and independent test results you can go one of two ways: 2.1) Specify the traning routine and always pick the last checkpoint (e.g. train for 2k epochs, lr=..., weight_decay=..). Then evaluate on the test data (session_E). 2.2) Split your training data (session_T) into train and validation data and pick the checkpoint based on some validation metric. Then evaluate on the test data (session_E). In the case of BCIC IV 2a (only 288 train samples) it would probably make sense to do a k=4/5-cross-validation and then average over the k folds as the number of samples in the validation data is quite small (58-72 samples).

Your method obviously gives "optimal" results for the test data as you pick the model based on the test accuracy. This is not what "hold-out" means.

eeyhsong commented 1 year ago

@martinwimpff Thank you for helping me figure out the validation and test. I have changed to the way like 2) you mentioned in my next work. So sorry for the confusion caused by my code.

rmib200 commented 1 year ago

Regarding this section of your comment:

2. Split your training data (session_T) into train and validation data and pick the checkpoint based on some validation metric. Then evaluate on the test data (session_E). In the case of BCIC IV 2a (only 288 train samples) it would probably make sense to do a k=4/5-cross-validation and then average over the k folds as the number of samples in the validation data is quite small (58-72 samples).

How do you suggest doing the cross-validation in the case of dataset 1? The sessions for each subject are pre-segmented into training and evaluation.

edw4rdyao commented 9 months ago

@eeyhsong thanks for the quick response.

Just to make my point clear:

  1. You should always split the sessions in two separate sets (as you did)
  2. To produce reliable and independent test results you can go one of two ways: 2.1) Specify the traning routine and always pick the last checkpoint (e.g. train for 2k epochs, lr=..., weight_decay=..). Then evaluate on the test data (session_E). 2.2) Split your training data (session_T) into train and validation data and pick the checkpoint based on some validation metric. Then evaluate on the test data (session_E). In the case of BCIC IV 2a (only 288 train samples) it would probably make sense to do a k=4/5-cross-validation and then average over the k folds as the number of samples in the validation data is quite small (58-72 samples).

Your method obviously gives "optimal" results for the test data as you pick the model based on the test accuracy. This is not what "hold-out" means.

@martinwimpff hi, Martin: ): I think you are right. Have you tried to reproduce the experiment using the method you mentioned? What was the result?

martinwimpff commented 9 months ago

@edw4rdyao You can check out this preprint. The code will be available as soon as the paper gets accepted.

edw4rdyao commented 9 months ago

@edw4rdyao You can check out this preprint. The code will be available as soon as the paper gets accepted.

@martinwimpff Nice work!I hope your paper can be published. If you can only open source the code that reproduces other models, that would be great.Thanks again for the job you guys did!

martinwimpff commented 9 months ago

@edw4rdyao the code is now online at https://github.com/martinwimpff/channel-attention

edw4rdyao commented 9 months ago

@martinwimpff Thank you for your work, I will study it carefully!