modeling EEG data using ML

MarkNelson86 commented 4 years ago

@harveyaa @surchs @k-jerbi @illdopejake I would appreciate your advice.

I want to design a machine learning algorithm to predict trial-wise behavioral measures (object recognition) from single trial EEG recorded at stimulus encoding, but I have very little practical ML experience

I know that EEG amplitude is significantly more positive in a specific time window (~400ms) over a specific cluster of electrodes (~22 electrodes, central scalp) during stimulus encoding for objects that are later recognized.

I have: (1) EEG data from 64 electrodes for all 504 trials (trials of interest & non-interest) stored as 1000 individual files (by subj) OR (2) a single file which contains data from only 5 electrodes (central midline) for only 50 trials (only trials of interest) for all subs.

My question: The more data I have for a ML model the better right? Is it worth the effort & time to go back to the individual subj files so that I can include the entire cluster of 22 electrodes? Should I actually include all 64? Or is there a trade-off in feeding your model signal vs noise? E.G. those 22 electrodes show the greatest SNR in my time window of interest, while others (e.g. EOGs or over lateral scalp have lower SNR i.e. signal is much lower there).

THANKS!

k-jerbi commented 4 years ago

Hi Mark - These are good questions, and your project sounds exciting. We have done related work in my lab, i.e. predicting correct stimulus recognition from electrophysiological features recorded during encoding.

So your plan is to predict correct vs incorret recognition, based on EEG data acquired during stimulus encoding. So you have a binary classification problem and your question is related to the feature space you should ideally use (how many electrodes).

By the way: I am not sure what the trials of interest or non-interest refer to in this data set (?) Also, you mention a window of interest around 400 ms, was this identified based on time-averaging all trials? (ERPs).

My recommendation would be to start with the 5 electrode data you have in order to establish a functional script that runs classification (with appropraite cross-validation) and the desired EEG features (are you planning on computing features from the raw signals, e.g. spectral power, or were you hoping to feed the raw signal amplitudes to the classifier?) Once you get this to "work", I would then recommend expanding to more data (more electrodes and more trials). You will also have to decide whether you want to do multifeature classifiction where data from al the electrodes are provided to the classifier, or whether you want to attempt single feature classification (one electrode at a time). One advantage of the latter is that you get a DA (decoding accuracy or AUC) fo each electrode, and you can then plot the classifier performance across a brain topography (using topographic maps). In terms of interpretation, this can be very helpful.

if you run multifeature classification, and depending on the classifier you use, you could extract feature importance (ex. with random forest), and this is again useful for interpretation.

I hope this helps you get started. Happy brain-hacking.

Karim

illdopejake commented 4 years ago

Hi Mark,

Sounds like a great project! Karim had a nice answer, so I'll just piggy-back a bit off of what he said.

To be clear, the idea that "more data is better" really pertains to more observations (e.g. subjects / trials). Adding more features (i.e. electrodes/locations) can be fine if you have enough samples to support it, and/or you are doing some kind of data reduction, data selection or regularization (the latter usually implicit in your ML algorithm).

In my personal experience, using less feature that are a priori known to be more relevant to the question usually produce better models than when you throw the whole kitchen sink at the estimator. However, the beauty of the train/validation/test framework is you can actually test this empirically. You can experiment with cross-validation in your training set using all features, or only selected features, and see if there is a substantial improvement by adding more features.

However, I agree with Karim that, for now, it might make sense to get your code "working" on the set you have available to you now.

I think it's also very important to think about your study design here. Are you interested in predicting within subject or across subjects? This is an important question that will define a lot of what else you will do. Final note: once you decide on your design, don't forget to set aside some data as your eventual test set before you start playing around! ;-)

I'd be happy to provide any other help going forward! Best of luck.

MarkNelson86 commented 4 years ago

@k-jerbi @illdopejake Thanks for the very helpful and detailed insight! In response to Karim's questions:

(1) Trials of interest vs non-interest?: The task was a 3-stimulus oddball detection task, followed by a recognition test for oddballs. Subs responded to targets, oddballs were deviants. So trials of interest = oddballs.

(2) Does the time window from from ERPs?: Yes, the time window is taken from grand avg ERPs of trials sorted by behavioral outcome (recog vs not-recog).

(3) What am I planning to feed the classifier?: I'm planning to feed single trial EEG amplitude to the classifier. Then, if there's time, to experiment with derivative features like spectral power.

My goal is to predict single trial recognition within subject. I don't know if that's even feasible, considering how noisy single trial EEG data is, but it should be fun nonetheless.

Again, great help from you both. I am excited to put these concepts into action next week.

harveyaa commented 4 years ago

Hi Mark,

I'm not familiar with the kind of the data you'll be using for prediction, and the answers above are great so I'll just tag on a bit. It's definitely a good strategy to get your code "working" before making anything more complex, but once that's good I would go back to your EEG data from the 504 trials and filter it to the trials of interest and then from 64 electrodes to the ~22 that you know are most relevant (basically doing feature selection) to see if you can do better classification from the larger dataset. If interpretability or specific electrodes aren't important you could do PCA using all the electrodes before doing the classification for data reduction and it might help with noise.

Looking forward to seeing your results!

MarkNelson86 commented 4 years ago

@k-jerbi @illdopejake @harveyaa

Hey all, sorry to bother you with this again, but I'm stuck and could use some advice. My issue is that my model is returning 100% prediction of a single class (binary classification), although my test set is actually split 57% / 43% between the 2 classes. Here are the details:

Model: sklearn.svm.SVC()
training data: X = np.ndarray (26985, 220) of floats. That's 26985 samples of an EEG time window 220 time points in length (360-800ms) at 2s intervals.
training labels: y = np.ndarray (26985, ) of strings. That's 26985 samples: 15429 = 'R' & 11556 = 'U' (R = recognized, U = unrecognized)

test data: X_test = np.ndarray (8995, 220) of type floats. That's 8995 samples of the same time window.

Test_results = model.predict(X_test) returns a 8995 length vector of 'R's. Another model, sklearn.svm.NuSVC(), produced the same result. I think the problem is my features, but I'm not sure what to do about it. Perhaps I cannot simply feed an EEG time window to this model? But it would be cool if I could. I could shorten the time window: perhaps just include the P3 component (360-500ms)? But I want to avoid taking the mean and reducing this to a single feature.

I appreciate any advice you have. Thanks!

MarkNelson86 commented 4 years ago

UPDATE: I tried another model, sklearn.svm.LinearSVC(), and got 49% prediction accuracy. I shortened the data vectors to the time interval 360-500ms, and got 54% accuracy. So I guess I got it working. Though I still don't understand why SVC & NuSVC failed.

Now I'm trying to figure out better features. I am reading Holdgraf et al. 2017 for this.

brainhack-school2020 / MarkNelson86_EEGRecogBIDS

modeling EEG data using ML #3