Closed martinwimpff closed 6 months ago
Thank you @martinwimpff, I value the time you took to investigate the code and appreciate your feedback and advice.
I agree with most of your comments, but a few are not very clear to me. First, I would like to ask you about the code you used to reproduce the results. You said, "After implementing everything", did you use the same code as in this repository or did you use a different implementation (e.g., your own)? The reported results are "comparative results" especially with reproduced models having the same training and testing settings as the proposed model. I agree that choosing the best results on test data is not based on best practices due to bias in the test data, however, we have reported it to align with most papers in the literature in this field that report the best results.
For best practices, we have two options:
Either divide the dataset into three parts as you mentioned in your first point (train, val, and test), and the test set should be used to evaluate the model once training/evaluation/optimization has been done. This method of evaluation is particularly important in a production situation where the reported accuracy must be close to the accuracy in real-world scenarios.
The second option is to compute the average performance over several random runs as you mentioned in your second point, where each run is an independent random training and evaluation procedure. In this case, there would be no harm to divide the data into only two parts (training and testing) because there would be no bias toward the test data. Here all runs are random and the average performance in all runs is a good indicator of model performance. In the code, we computed the best-run performance as well as the average performance over all runs, although we didn’t report the average performance in the published paper. However, in our new related paper, we report the average performance over 10 random runs.
The following points are not clear to me:
Since you are evaluating this model using a different evaluation method, I would appreciate it if you could share with me the results obtained based on your training/evaluation procedure. Also, what is the relative performance of this model compared to others you have? If other models perform better than this one, could you please share them with me?
Thank you again for your time and feedback
Hi @Altaheri,
yes, I implemented it on my own in pytorch/pytorch-lightning.
Regarding best practices: see this Blogpost. The reported acc should always be realistic and never should be fantasy numbers (even when other publications might do that). The second option is okay, as long as the test acc does not influence the specific checkoint. So in your case: instead of using EarlyStopping, use a fixed number of epochs.
Until now, I only ran your model twice (for all subjects, subject-specific, w/o Early Stopping). The accuracy that I got was 78-79%, which is ~7% less than what you have reported. I believe that this is still a good result but not an "outstanding" one especially considering the inefficiency of the architecture.
Thank you @martinwimpff for all the info, Your point is now clear, and I will keep it in mind for future research/updates.
Good luck with your research
Hello, may I ask if you can share the PyTorch code you reproduced with me? My email is hancan@sjtu.edu.cn. The issue I'm currently facing is that the results obtained by running the provided TensorFlow code match the paper, despite the reproducibility issues. However, the performance of my self-implemented PyTorch code is quite poor. In fact, even with the exact same data preprocessing as the author's, the accuracy of the EEG-Net model I run only reaches around 73%, far from the 80% reported in the author's paper when reproducing EEG-Net with adjusted random seeds. I would like to understand why the author's recognition accuracy for the reproduced EEG-Net is so high (compared to results reported in other papers reproducing EEGNet). I'm not sure if there is an error in my PyTorch code, and I hope you can share your code with me for reference. Thank you very much! @martinwimpff
@hancan16 be happy with the 73%, don't chase results that are obtained using an invalid training procedure. If you want to compare your architecture against other models using attention: check this out. The code is available at https://github.com/martinwimpff/channel-attention/
@hancan16 you can compare your Pytorch implementation with this one: https://github.com/braindecode/braindecode/blob/master/braindecode/models/atcnet.py
In the file main_TrainValTest.py, we have adopted the guidelines detailed in this post (Option 2). The results based on this methodology are: Seed = 1, Accuracy = 81.41 Seed = 2, Accuracy = 80.28 Seed = 3, Accuracy = 81.37
Hi @Altaheri,
I am trying to reproduce your results which honestly looked too good to be true to me. After implementing everything I was not even close to your results. Then I observed your training routine in detail and found 3 major reasons/flaws why your model performs "so well" (it doesn't). After I changed my pipeline to look like yours I got similar results. The problem with those results however is, that they heavily rely on the randomness of the training routine and the missing independency of your test set:
To backup my findings, I ran a few (subject-specific) experiments for subject 2 (bad subject) and subject 3 (good subject). I used EarlyStopping and chose either the last checkpoint or the best and ran the experiment with 10 different random seeds. Results: subject 2: Average accuracy (last ckpt): 63.3+-3.0 Optimal seed accuracy (last ckpt): 67.4 Average accuracy (best ckpt): 67.0+-3.8 Optimal seed accuracy (best ckpt): 71.9 subject 3: Average accuracy (last ckpt): 90.6+-2.7 Optimal seed accuracy (last ckpt): 94.8 Average accuracy (best ckpt): 94.6+-0.8 Optimal seed accuracy (best ckpt): 95.8
If you have any further questions, feel free to ask!