Subject count still inconsistent

liutianlin0121 commented 7 years ago

I just read your updated report, but the "subject counts" in the report is still inconsistent with those of in codes.

In the above table, you argue that there are 28 YH. In the code, however, if you load in thirdGroup.mat (the OL vs YH group), you'll see there are 29 YH (from cell 25 to cell 53, as the young subjects codes are shorter.)

What is more, in the table you state that there are 23 OL and 24 OH. If you load seconddata.mat (OL vs OH group) in matlab, you see there are 48 cells. So "23 OL and 24 OH" is clearly not true. I assume there are 24 OL and 24 OH in codes.

In sum, we have 1 more YH and 1 more OL in codes compared to what was stated in the report. I wonder if we should kick out one YH sample and one OL sample from the training sample in codes. This inconsistency may be more serious than it first seems: as we have an extremely small training set (it is even much smaller if I separate out a genuine testing set -- but not a validation set), if we mix in the training set with some "bad sample", which should have been kicked out, this could seriously undermines the final results.

angerhang commented 7 years ago

You are absolutely right on this. Thank you for careful reading for both the report and the code. It's true that we have 29 YH and 24 for both OL and OH, as I just confirmed with the the subject label and original subject group sheet.

That being said, I don't think we should kick out anything because the code was executed first and the report was written latter so updating the report is enough. I will do another proofread this weekend. Two other co-authors have not yet signed the declaration form, so there is time for last-minute change :D

liutianlin0121 commented 7 years ago

Thanks for your responses. I am still thinking we should either redo the experiments using genuine testing set, or we should ask Prof. Jaeger about the test/validation set question we have in mind. We have to be 100% sure that the "test error" in the report is NOT a result of an artificial overfitting --- from my own play-around trials, this actually seems very likely. What is more, sooner or later Prof. Jaeger will be aware of the issue -- if this is indeed an issue-- either by himself or from me. To my feeling, there will be extra inconveniences and embarrassments if it is realized only after the technical report is released and is endorsed by professors. So it is my suggestion that we use the training set {12 OL, 12 OH, 12 YH} and testing set {remaining 12 OL, remaining 12 OH, remaining 17 YH}, and redo the 2 experiments (OL vs YH, YH vs OH) to check if the claimed results are consistent -- if it consistent, everything is fine, and it makes your results much more solid; if not, perhaps it is best to point out our concern before the report is released. That being said, I am not an author of this report anyway, so it is merely a suggestion. But as a technical report is a citable publication, I suggest being more cautious.

angerhang commented 7 years ago

That's worth pondering.

Why do you think the result can be very prone to overfitting based on your own experiments? What I do know is that the classification result indeed had high variance with respect to the parameters. This is not surprising because we are doing a 3-fold classification.

The 50-50 wouldn't work in this case, because EEG is really high-dimensional and by such splitting it's more unlikely to get anything at all. But what I have seen in recent studies is that people report the held-one-sample-out-validation approach, which is to get the averaged results from using every single data point as the only validation set and train on the all other subjects.

Also I don't think we are doing something really wrong because we reported what we obtained using the 3-fold-cross-validation. Jaeger is aware of the fact that the test error was based on 3-fold-cross-validation

But you do raise a good point, I will send an email to them to ask their opinions on this.

liutianlin0121 commented 7 years ago

Thanks a lot!

I think it is prone to overfitting based on my experiment of 50-50 split up: training set = {12 OL, 12 OH, 12 YH} and testing set = {remaining 12 OL, remaining 12 OH, remaining 17 YH}. With modifications of your scheme, it is actually not hard to achieve 100% correctness on all training samples with an exceptional small reservoir with 4 neurons -- this is achieved by bookkeeping the so called "segment-end-states" as reported by Prof Jaeger in his paper "Optimization and applications of echo state networks with leaky- integrator neurons". However, even with this small net, the ESN still overfits -- training results are 100% correct, but testing results are completely rubbish. I didn't try 50 -50 split up with your exact scheme yet, as it requires a large reservoir (1000 or more) and takes longer time. I might do it tomorrow during the night.

I am not sure why we can't 50 -50 split up for dimensionality reason. Our EEG has 12 channels, which is actually quite common a dimensionality amongst other classification tasks. Take the "Japanese Vowel Classification task" reported in the above mentioned Prof. Jaeger's paper "Optimization and applications of echo state networks with leaky- integrator neurons" as an example: each vowel has 14 channels (> 12 channel in our case), and it is asked to assign 9 classes (> 2 classes in our case) using only one ESN ( < 3 ESNs in our case). Yet in this example, 270 training samples and 370 testing samples are used. There might be other reasons that 50 - 50 split up is not appropriate in EEG case, but I assume dimension of 12 is not one.

I still think that, whatever the task is, there should be a "clear cut" (which is lacking in the current report) between training and testing sets -- validation sets is a subset of the training set and should be distinguished from testing sets. But I'm not sure if it should be 50-50 or 60 - 40 or 99 - 1. That's why I think it will be great if we could bring this up to Prof. Jaeger and Prof. Godde explicitly and ask their opinion on this -- this is also important for my future work. Thanks a lot for doing it! It'll be nice if you could cc me.

angerhang commented 7 years ago

One thing in the Japanese Vowel Classification task you mentioned is the number of features each data point has is significantly less than the time steps we have for each channel. That's why the example is not comparable.

angerhang / thesis

Subject count still inconsistent #12