Questions on Pre-processing

liutianlin0121 commented 7 years ago

Thanks for your answers yesterday, I think the training machinery for ESNs is quite transparent for me now. I got questions about preprocessing today:

In names.mat, there are 77 names of samples. In "raw_eeg" folder, there are 78 samples (after deleting some code files which somehow got mixed in the raw_eeg folder). By comparison, the subject of Y3C_3 is included in "raw_eeg" file, but not in names.mat. What is wrong with Y3C_3? It is not documented as a droupout in the dropout.docx.
In dropout.docx, I wonder what do you mean by "Spiro", "RQ < 1", and "Spirofehler" (The names under the "Spiro" title seem not have been dropped out.)
Would you describe the purposes of these .csv and .xlsx files (figure below)? My guess is that these files are directly inherited from Prof. Godde group. It seems that the only relevant information to the classification task is the "learn" column, which tells us which group a subject belongs. But since you've already summarized this info in labels.csv and names.mat, perhaps we don't need to keep these files anymore?

What is the file outlier.R?
In report you state that the raw EEG was recorded with 32 channels. However, if I read any raw EEG with pop_biosig and execute EEG in the MATLAB command window, I see the raw data have 47 channels instead of 32: nbchan: 47. data: [47x161792 single].
Examining from pop_eegplot( EEG, 1, 1, 1), It confuses me that most channels in these 47 channels of raw data are "empty" for some reason. For instance, if I followpreprocess.m and pre-process the EEG of the subject 2a8i7h_3,

dataPath = strcat(RawDataPath, '2a8i7h_3', '.bdf');
EEG = pop_biosig(dataPath);
[ALLEEG, EEG, CURRENTSET] = pop_newset(ALLEEG, EEG, 0,'gui','off'); 
EEG = eeg_checkset( EEG );
pop_eegplot( EEG, 1, 1, 1);

I see this:

Note that all 47 channels are displayed here, with lots of channels being empty. How can I plot the raw data like figure 5 (see below) in your report?

My guess is that these channels look "empty" because they are corrupted by too low/high frequency components. After bandwidth filtering, all channels show up non-empty. So I assume that perhaps the above Figure 5 is NOT the "raw data " as claimed in the figure title, but the data after some bandwidth filtering (or other treatment).

In the report, you claimed that you use bandpass filter 0.5 - 79 Hz to clean the raw data. In the code, it seems that you use the bandwidth of 1 - 40 Hz for all raw data.:

EEG = pop_eegfiltnew(EEG, 1,40,1690,0,[],1);

Thanks!

angerhang commented 7 years ago

Ohlalala lots of question:

Thanks for your answers yesterday, I think the training machinery for ESNs is quite transparent for me now. I got questions about preprocessing today:

Q1. In names.mat, there are 77 names of samples. In "raw_eeg" folder, there are 78 samples (after deleting some code files which somehow got mixed in the raw_eeg folder). By comparison, the subject of Y3C_3 is included in "raw_eeg" file, but not in names.mat. What is wrong with Y3C_3? It is not documented as a droupout in the dropout.docx.

A1. Yeah Y3C was a left-handed subject. So we removed it. This was missing in the recording procedure so the dropout didn't include it.

In dropout.docx, I wonder what do you mean by "Spiro", "RQ < 1", and "Spirofehler" (The names under the "Spiro" title seem not have been dropped out.)

A2. I don't know either. I didn't do the experiment part, so I don't know. But if the document says subjects whose RQ < 1 are not dropped, then we don't drop them out

Would you describe the purposes of these .csv and .xlsx files (figure below)? My guess is that these files are directly inherited from Prof. Godde group. It seems that the only relevant information to the classification task is the "learn" column, which tells us which group a subject belongs. But since you've already summarized this info in labels.csv and names.mat, perhaps we don't need to keep these files anymore?

A3. Yes that's right. Most of the excel files are intermediate files for generating the relevant subject groups. But subject_meta.xlsx should be kept no matter what.

What is the file outlier.R?

A4. This was used to get rid of the ALS outliders and spilt the data into different groups.

In report you state that the raw EEG was recorded with 32 channels. However, if I read any raw EEG with pop_biosig and execute EEG in the MATLAB command window, I see the raw data have 47 channels instead of 32: nbchan: 47. data: [47x161792 single].

A5. Because there are many irrelevant channels.

Examining from pop_eegplot( EEG, 1, 1, 1), It confuses me that most channels in these 47 channels of raw data are "empty" for some reason. For instance, if I followpreprocess.m and pre-process the EEG of the subject 2a8i7h_3,

dataPath = strcat(RawDataPath, '2a8i7h_3', '.bdf');
EEG = pop_biosig(dataPath);
[ALLEEG, EEG, CURRENTSET] = pop_newset(ALLEEG, EEG, 0,'gui','off'); 
EEG = eeg_checkset( EEG );
pop_eegplot( EEG, 1, 1, 1);

I see this:

Note that all 47 channels are displayed here, with lots of channels being empty. How can I plot the raw data like figure 5 (see below) in your report?

A6. To see what figure 5 has, you need to remove DC offset in the EEGlab plotting routine.

In the report, you claimed that you use bandpass filter 0.5 - 79 Hz to clean the raw data. In the code, it seems that you use the bandwidth of 1 - 40 Hz for all raw data.:

EEG = pop_eegfiltnew(EEG, 1,40,1690,0,[],1);

A7. Oh this is a typo again, 1-40Hz is the right one.

Thanks for all the questions :D

liutianlin0121 commented 7 years ago

Thanks :D

angerhang commented 7 years ago

I think before you dive into the details, it's important for you to understand the higher-level methodology. What the motivation is, what don't have at the moment, and how do we approach the problem.

The thesis is interesting from two perspectives. One lies in the engineering side, trying to use ESNs to build a good classifier, the other lies in the neuroscience side, how do we use neural networks to prove a hypothesis we want to have. In a sense that the original idea is that we believe in the older subjects, there are good learners and bad learners, and in those good learners they have good performance because of their learning skills are decent, we also suspect there are good learners whose learning skills aren't as good but they can achieve fine results by some other compensating effects in the brain. We would like to know how to prove this hypothesis and what the influencing factors are.

Some ideas are similar in this paper: https://arxiv.org/pdf/1705.08498.pdf, except that we are also trying to build some ML-assistive visualizations for hypothesis searching which is mentioned at the end of the report.

liutianlin0121 commented 7 years ago

thx for pointing this out.

angerhang / thesis

Questions on Pre-processing #10