NDCLab / brainBox

a suggestion box for brainy ideas
GNU Affero General Public License v3.0
1 stars 0 forks source link

tool-EEG4p #7

Open F-said opened 3 years ago

F-said commented 3 years ago

Problem: Across EEG preprocessing software tools, there appears to be variation in source data used for example analysis. Switching contexts when explaining the utility of a pipeline tool obscures the comparative effectiveness of one tool in a set of competing tools.

Alternatives:

Design: Inspired by the database/framework defects4j, I propose an open-source and standardized set of diverse & unprocessed EEG data that comes with a python framework necessary to generate data, plug-in pipelines, and test clean data.

It will come with a container for running, testing, and developing. Some creative feat perhaps could be inserted in this section that allows for easy containerization of varying EEG tools across languages. Ex: A script that parses import libraries and appends those libraries and their dependencies to the container. It would also contain the necessary scripts to allow for parallel processing on HPC clusters.

This framework would directly inherit the testing metrics composed from the PEPPER-pipeline.

Informally, the design would like this:

EEG4p informal diagram

Each tool is manually pulled into a cluster (preferably) by the user. The common test suite is run on each output of each pipeline (all of which are run in parallel by the cluster). A multidimensional dataset explaining each metric is then created and saved upon completion:

metrics

Lastly, EEG4p stands for EEG-4-Processing.

Funding:

Authors: If a paper were to be written, and considering the pre-requisite EEG knowledge required to compose such a dataset, I propose that someone knowledgeable of EEG data take on a first author role, while I can focus on the work of implementing the testing framework, makefiles, and data hosting.

But on the software side of things, anyone that makes a valid contribution is a contributor.

Milestones: Assume this project's planning is begun October 1st. Perhaps the following schedule could work:

  1. Lab presentation. [November 1st, 2021]
  2. 1st pre-release announced, stable for contributors, and tests for the framework are made. [November 22nd, 2021]
  3. 1st release announced, stable for researchers [December 1st, 2021]
  4. Perhaps a paper is written on this approach? [January 2022?]
georgebuzzell commented 2 years ago

@F-said I like this idea a lot! Honestly, I think the idea is well-thought-out, clearly articulated, and would be a much-needed addition to the field. I have no "major" suggestions for improving/refining the idea. Thus, I feel a PR is imminent... ;) I think the main question is only a matter of time/resources that we need to think through. That is, WHEN to develop this, and where is sits in relation to other ongoing priorities. I am not implying that this should not be bumped high up on this list, just that we will need to discuss...

My main questions are as follows:

  1. I think it is an excellent idea to create the standard benchmark database, and, the infrastructure to allow tests against this benchmark dataset. The main thing to work out is what "kinds" of datasets to include. I would think that this project should likely be a continually evolving projct, where more benchmark datsets are added over time. So, I think what is needed is two things: 1) A semi-exhaustive list of all the kinds of datasets that we would ultimately want included, and how many participants should each dataset include. Kinds of data can be defined on several dimensions, including (but not limited to): EEG system, sampling rate of data, number of electrodes, kinds of electrodes, age of participants, standard demographic variables, length of recordings, tasks used for data collection, individual differences in the participants (demographic variables, age, psychiatric diagnoses). 2) Which kinds of datasets to include in the intial release, what to add in the planned 2nd/3rd releases. I think that most likely it makes sense to start with maybe only medium-to-high density datasets, from only 1-2 systems, at maybe 2-3 age points, and only for "resting state" and 1-2 tasks to start with.
  2. The other thing to think about, is whether simulated data should also be included in the initial release. I lean towards yes.

Again, this is a FANTASTIC idea @F-said!

georgebuzzell commented 2 years ago

@DMRoberts @stevenwtolbert @SDOsmany @yanbin-niu Would love to hear any input from you all on this idea from @F-said Any suggestions for improvement? Key things that @F-said and I might not be thinking about?

PS: For those new to "BrainBox" note that this is a place to propose and discuss new POSSIBLE projects for the NDCLab. Ideas start here, and anyone can propose them. We give each other feedback, determine if the idea is viable and worth pursuing and if so, we decide when it makes sense to pursue. For now, the main focus is on feedback, so any and all thoughts are welcome. Of course, you all should also feel free to propose an idea, and if viable, it can become a project that we move ahead with in the lab.