Determine metrics to use for validating/comparing PEPPER-pipeline

georgebuzzell commented 3 years ago

Background We have an existing testing suite for testing the syntax/runtime of the pipeline. However, we need an expanded testing suite that computes a series of metrics that can be used to compare the PEPPER Pipeline first to MADE, and then to other pipelines. These metrics should provide information about whether the pipeline is properly cleaning the data or not. However, some metrics require particular kinds of data to run on (e.g. task vs rest data, repeated visits, etc.). Thus, integral to this issue is also identifying the data that will be used to test the pipeline.

Step 1a: Propose/discuss in the comments what the most appropriate metrics are to use. We should identify a list of metrics that can be computed, agree on what they reflect and if they are useful or not, and agree on which ones are the "ideal" metrics to use, vs ones we will use only if we cannot access the required data for the ideal metrics.

Step 1b: Identity what combination of samples (i.e. age, demo), tasks (specific tasks and/or rest), and EEG system (amp/net) the metrics should be computed for.

step 2: After agreeing on the ideal list of metrics, we then need to identify the data sets that can be used to run the pipeline on and compute the metrics. It is possible that the required data will not be available for all metrics and/or all samples. In these cases, we need to agree on what metrics and/or samples are "good enough" for the initial validation (and paper) for PEPPER.

Defined as Done

Consensus on the ideal metrics to use, and what they reflect
Consensus on the ideal combination of samples/systems/tasks and metrics to compute
Consensus on what constitutes a "good enough" combination of identified datasets and metrics

georgebuzzell commented 3 years ago

Initial suggestions:

Real data Metrics

General -Percent of segments/trials retained -Split-half reliability (after norming for average trial distance to correct for inflated reliability when analyzing “clump” of trials -test/re-test reliability (after controlling for days in between) -Correlation with template blink, saccade, emg

Rest -Db-power of eyes open/close (or lights on/off alpha)

Task -SNR of a sensory, auditory, and cognitive ERP -dB-power of sensory, auditory alpha suppression -dB-power of theta, delta response to control event

Could also go with simulated EEG data metrics: Simulate clean EEG data. Simulate various kinds of noise and add to clean data. Test correlation between simulated clean data and the output after processing data that had various types of noise added

F-said commented 3 years ago

V. interested in collaborating on this as well @georgebuzzell. Is there any pre-requisite knowledge I can review to contribute?

SMoralesPhD commented 3 years ago

I think the metrics proposed make sense. However, I have a couple of suggestions. I think in general, I would lean toward starting with something that is relatively easy to measure SNR and reliability like ERPs, and then build up from there. In my experience, using reliability on things like resting state power is more difficult because high reliability is achieved very quickly, producing ceiling effects. Similarly, doing dB power is a bit of a headache because examining things like SNR or reliability are trickier because the baseline procedure is not linear - so it is not as straightforward as ERPs. If we wanted to start with task ERPs on children, it would not be difficult to get datasets with sensory or response-locked ERPs. However, that gets harder for infants. We could do auditory ERPs with OIT, but I am not sure we have readily available datasets with visual or response-locked ERPs. We could also just start the process with children or even adults to get started and then work our way to infant data. Finally, I think we could start with just internal consistency reliability rather test-retest. I am not sure if we have datasets for test-retest. I am not opposed to doing it another way, but those would be my suggestions.

georgebuzzell commented 3 years ago

Good points, Santi. I'd like to first identify both what we ideally want to test and then what we can test. Mu thinking is that we can put out a call on twitter for public data we want to use and might get lucky, and if not, we can always scale back.

I would think that, ideally, we want to test the pipeline on either 64 or 128 chan nets/caps for: -an infant group (3-12 mos) -a young child group (4-6 years) -an adolescent group (10-18) -an adult group (18-35)

Also, we ideally want resting data, and event related data for each. Moreover, it would be even better if we have both auditory and visual erp tasks for each, but I think having just one is fine. Similarly, I think stim and response locked would be great but not crucial.

Also, while I proposed 4 age groups as ideal, I think even just having 2 to start with is ok, but 3 would be much better.

Id like to discuss non linearity for fb power further as it may or may not be an issue for our purposes. Regardless, I do agree starting g with easier metrics first is the way to go, but, I would still like us to nail down the metrics we want to have for the paper, even if we start with the easy ones first.

In my mind, although it is easier to identify measures of signal quality in erp data, we really cant neglect the rest data and metrics, and should start there as well. I do agree that test retest measures are likely not feasible, given lack of datasets.

georgebuzzell commented 3 years ago

What are people's thoughts on the measures of residual noise? I.e. creating a group average templates for blinks, saccades, emg, and then testing their correlation with the cleaned data?

georgebuzzell commented 3 years ago

What about using fooof metrics? E.g. presence of 1/f, ability to detect an alpha peak, etc?

georgebuzzell commented 3 years ago

I also think that while we want to compute reliability metrics, they have lower ground truth in terms of detecting clean data. That is, if a pipeline arbitrarily removes 99% of the variance then any reliability metrics computed on what is left will be very high, even if nearly all the good and bad data have been removed.

I do think reliability is a useful metric wr want to compute, but similar to computing the percentage of trials retained, it has lower ground truth validity, imo.

georgebuzzell commented 3 years ago

Does anyone know if we can get access to a dataset, for each age, that has rest during either eyes open eyes closed or lights on lights off?

F-said commented 3 years ago

Dataset

CMI (10 adolescents, 10 adults)
HAPPE validation datasets (infant data)
- needs to be BIDSified

Metadata

an infant group (3-12 mos)
an adolescent group (10-18)
an adult group (18-35)
64 channel system
ERP
- visual
- auditory
Rest

Workflow

PEPPER as a whole (a framework of preprocessing features)

Metrics

Signal to Noise ("luck" method that includes ERP & baseline noise)
split-half reliability of ERP's (at least one sensory ERP & one cognitive ERP)
split-half reliability of resting-state power in multiple bands (theta, alpha, beta, gamma)
~~Db power change~~ (want, not a need)
~~1/f~~ (future possibility)
reliability of power (rest)

Challenges

~~what is noise and what is clean neural data?~~
~~(Farukh) what is optimal neural data? Does a target exist?~~

georgebuzzell commented 2 years ago

The CMI dataset will give us up to 128 chans (can of course down sample to 64) for the adolescent and adult groups, for erp and rest. however, would need a similar infant dataset...

georgebuzzell commented 2 years ago

other possible datasets for infants/toddlers:

https://figshare.com/articles/dataset/infant_EEG_data/5598814 https://doi.org/10.5281/zenodo.998965 https://www.kaggle.com/vicolab/eeg-looming https://nyu.databrary.org/volume/1006 (but have to email for raw data)

Also, I emailed a couple of researchers with high-density infant data to see if they might have the kind of data we are looking for and might be willing to share

georgebuzzell commented 2 years ago

@SMoralesPhD @trollerenfr

georgebuzzell commented 2 years ago

@trollerrenfr

trollerrenfr commented 2 years ago

I would recommend condition differences added to the list.

I think the internal validity is covered well above, but I think we need external validity too.

SMoralesPhD commented 2 years ago

Here is Luck's paper: https://doi.org/10.1111/psyp.13793

georgebuzzell commented 2 years ago

Per @trollerrenfr comment, we should also be sure to include ERP condition differences in the initial tests.

Revised list to move forward with:

Condition difference (difference score) in p1 or n1 and p3 for the task data.
Signal to Noise ("luck" method that includes ERP & baseline noise). specifically, the "new" metric that accounts for the ERP amplitude, not just baseline. Will compute this for conditions and condition differences.
Eyes/open condition difference (difference score) for alpha power for ages that have it.
Luck measure applied to frequency bins across the 1 hz - 40 hz spectrum (~2 hz bins) for all ages.

Starting with the CMI data first, don't worry about infant for now.

NDCLab / pepper-pipeline