modify dev-feature-io to split single eeg recording into separate tasks and write as separate bids files

NDCLab / pepper-pipeline

tool | Python Easy Pre-Processing EEG Reproducible Pipeline

GNU Affero General Public License v3.0

3 stars 3 forks source link

modify dev-feature-io to split single eeg recording into separate tasks and write as separate bids files #16

Closed georgebuzzell closed 3 years ago

georgebuzzell commented 3 years ago

Currently, dev-feature-io does this:

Reads in the mff file, writes to fif, loads back in, extract markers/events, and then writes in bids format, albeit as one huge file.

What we need dev-feature-io to do is this:

dev-feature-io reads in the mff file, writes to fif, loads back in, extract markers/events, identifies the locations where the recording was paused, which separates the long recording into 11 distinct task segments, identifies the marker code near the beginning of each segment to identify which task that segment corresponds to, and then iteratively writes each task segment as a separate (bids-compatible) file, all within the same folder.

@DMRoberts @Jonhas @DavyNeat @yanbin-niu @SDOsmany @Pranjali051 @stevenwtolbert @CamachoBry

georgebuzzell commented 3 years ago

This feature does not include adding the ability to also write bids formatted channel and coordinate system files, that will be dealt with as a separate feature, which can be developed in paralell

georgebuzzell commented 3 years ago

The current plan is to identify the latency of the task markers and to cut the data into separate files, based on the latency of those task markers. These files will each end up including some amount of the "pause" segments that just contain zeros, but that is fine.

For example, the first file would be from the start of the file to the latency of the 2nd task marker minus 1 (note that the first file does not start with the first task marker but the start of the file and ends with the second marker minus 1 sample point). The second task file would start at the latency of the second task marker and end with the latency of the third marker minus 1. Repeat for all tasks. Each task is written with all bids files. For each task file written, the event file should only include the event that occurs DURING THAT FILE, moreover, need to ensure that the event latencies in the event file are correct for that file.

To identify which task name corresponds to which event marker, please refer to the meta data here: https://drive.google.com/file/d/1Q6_Y6qb1VJVhLdV6CQkOKaD0mrN2Rtoc/view?usp=sharing

Filename pattern: xx_vis_learn_xx = Sequence Learning Paradigm xx_SurrSupp_Block_xx = Surround Suppression Paradigm xx_SAIIT_2AFC_Block_xx = Contrast Change Detection Paradigm xx_WISC_ProcSpeed_xx = Symbol Search Paradigm xx_Video_xx = Naturalistic Viewing Paradigm RestingState_xx = Resting State Paradigm (no behavioral data)

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Triggers for Start of the Paradigm:

90 = Resting State EEG; 91 = Sequence Learning Paradigm; 92 = Symbol Search Paradigm; 93 = Surround Suppression Paradigm Block 1; 94 = Contrast Change Paradigm Block 1;
95 = Contrast Change Paradigm Block 2; 96 = Contrast Change Paradigm Block 3; 97 = Surround Suppression Paradigm Block 2; 81 = Video1; 82 = Video2; 83 = Video3; 84 = Video4;

Ultimately, we may refine our approach for separating the task files, so that we further remove the periods of time where the recording was paused (and where only zeros are recorded for data). However, for now, we are just allowing these segments to remain, as they will be removed at later preprocessing stages anyways.

yanbin-niu commented 3 years ago

@georgebuzzell A quick question: do you know where we can find information about "task markers". I just cannot find such information. Sorry if I miss something here.

My preliminary understanding is: we can find events based on their channels, for example 90, then we can have an onset of 90 and onset of 91. Then we crop this period time. We can not use event code here, since channel and event codes do not match. See below, you can see there is a pulse in channel 91 for example:

Then based on this idea, the code should be :

events_90 = mne.find_events(raw, stim_channel='90  ', shortest_event=1,output='onset')
events_91 = mne.find_events(raw, stim_channel='91  ', shortest_event=1,output='onset')
# convert the sample points into seconds
onset_90=events_90[0][0]/raw.info['sfreq']
onset_91=events_91[0][0]/raw.info['sfreq']

raw_90 = raw.copy().crop(tmin=onset_90, tmax=onset_91)

Then we can get the raw data of seg01, then problem is we can not write this raw data into BIDS directly. see below:

I tried to save raw_90 into .fif, and read back and then save into BIDS. It works and the results are below:

We can use the similar way to deal with other paradigms. Let me know if I miss something about "task makers", whether I am on the right track, or if there is better way to approach this.

I have not seen similar example, so I can not give the link to verify the idea. I got this from mne.find_events function. (https://mne.tools/stable/generated/mne.find_events.html?highlight=find_events#mne.find_events), for parameter stim_channel "Name of the stim channel or all the stim channels affected by triggers. ......". And, I double checked the duration of seg 1 (765.412 seconds) and it matches the onset time of 91 form initial raw data (see figure 1).

yanbin-niu commented 3 years ago

I just realized if the way I just mentioned seems possible, there is an issue. It is really limited to a specific experiment (it depends on we know the sequence of tasks - 90->91->...). Even we can get those information maybe from user input, the users have to know the exact name of the channels (you can notice there are even some spaces after 91 for that exact channel name). I do not think it is feasible. Maybe we can figure out a way to read the list of STIM channels, but the channels of tasks and events are all mixed together (unsure how to distinguish them), so you still can not get the exact channel name. Following this idea, we might still can find a way to get the exact channel name if that channel just has one event ( recall when you find_events, you have to set shortest_event=1 , means only have one event in this channel.) However, I really doubt the current MNE can do such thing and we may end up having to dig into MNE code (which is possible, but unsure of the time we will spend on this issue).

Sorry I just realized I may have gone to a wrong direction. I kind of understand what you mean about “task makers”. I think we should work from events from sti 14.

georgebuzzell commented 3 years ago

@yanbin-niu Thank you for working on this!

From the end of your second comment, it seems that you understand what I was referring to as "task markers", that is, I am simply referring to using the following "rask markers" (also called event markers, or stimulus markers, or triggers, etc):

Triggers for Start of the Paradigm:

90 = Resting State EEG; 91 = Sequence Learning Paradigm; 92 = Symbol Search Paradigm; 93 = Surround Suppression Paradigm Block 1; 94 = Contrast Change Paradigm Block 1; 95 = Contrast Change Paradigm Block 2; 96 = Contrast Change Paradigm Block 3; 97 = Surround Suppression Paradigm Block 2; 81 = Video1; 82 = Video2; 83 = Video3; 84 = Video4;

I was just saying in a general sense that we need to reference those in some way, identifying their latency and then extracting the relevant data. I apologize if my lack of specificity here led to confusion! Of course, there are different ways to implement what I said (I think) and the way you mentioned is one approach (referencing the individual stim channels). However, similar to your most recent comment, I was originally thinking of referencing the events through "sti 14", which would seem more straightforward to me. Though, I do not have a strong preference, as long as the implementation "works" :)

Clarification: Unless I am thinking about it wrong, I actually don't think we need to know the sequence of tasks per se, or at least, we want to implement this such that that would not be necessary. Instead, we just need to know what the task numbers are (which all users always will). Using this information, the code should then find the latency of all of those task markers first. After that, simply use markers 1 and 2 to figure out what task 1 was and when it ended (marker 2 minus 1 sample). Next, use markers 2 and 3 to find what the second task was (identity of marker 1) and when it ended (latency of marker 3 minus 1 sample), etc. Does that make sense?

Minor correction: It looks like you are cropping so that the recording for task 1 ends at the latency of marker 2. However, you want to be sure to have it end at the latency of marker 2 minus 1 sample point.

Finally, I did not realize that you can not write directly to bids using "crop". However, the workaround of writing to fif first and then bids seems like a good idea! Good catch!

@yanbin-niu please do let me know if I misunderstood any of your comments. Also, please do not hesitate to let me know if you want to jump on a short call sometime this week to discuss. Moreover, thank you so much for working on this!

yanbin-niu commented 3 years ago

@georgebuzzell I have tried to use sti 14 to figure out a way to approach this, but I can not find a way, since: 1) there is no unique and unambiguous way to find "task makers" from event codes; 2) there is no latency of "task makers" (e.g., 90, 91 etc.), I mean it's duration is 0. However, I would be totally wrong here.

Rather, how I would approach this is to work with individual stim channels. As you said, assuming we have a list of the task numbers, which is events_list = np.array(['90','91','92','93','94','95','96','97','81','82','83','84']). and we do not have to specify the sequence. Then the code is:

events_list = np.array(['90','91','92','93','94','95','96','97','81','82','83','84'])
# get an array of task makers and onset sample point
events_onset = np.empty((0,2))
for i in range(len(events_list)):
    events_name = ''
    print(i)
    # get events name
    if len(events_list[i]) == 1:
        events_name = events_list[i] + '   '
        print ('events_list[i] == 1')
    elif len(events_list[i]) == 2:
        events_name = events_list[i] + '  '
        print ('events_list[i] == 2')
    elif len(events_list[i]) == 3:
        events_name = events_list[i] + ' '
        print ('events_list[i] == 3')
    elif len(events_list[i]) == 4:
        events_name = events_list[i]
        print ('events_list[i] == 4')
    elif len(events_list[i]) == 0:
        print ('ERROR')
    else:
        events_name = events_list[i]
        print ('else')

    # get the onset 
    onset = mne.find_events(raw, stim_channel=events_name, shortest_event=1, output='onset')
    # if the event found is bigger than 1
    for j in range(len(onset)):
        arr = np.array([[events_list[i], onset[j][0]]])
        events_onset = np.append(events_onset, arr, axis=0)

then we get an array of task makers and onset sample point, which is :

Then, we write BIDS:

# convert type into int
events_onset = np.asarray(events_onset, dtype = np.int64)
# sort by sample point
index = np.lexsort ([events_onset[:,1]])
events_onset = events_onset[index]
# fime name dic
filename_dic = {90: 'RestingState_xx',
               91: 'xx_vis_learn_xx ',
               92: 'xx_WISC_ProcSpeed_xx',
               93: 'xx_SurrSupp_Block_xx',
               94: 'xx_SAIIT_2AFC_Block_xx',
               95: 'xx_SAIIT_2AFC_Block_xx',
               96: 'xx_SAIIT_2AFC_Block_xx',
               97: 'xx_SurrSupp_Block_xx',
               81: 'xx_Video_xx',
               82: 'xx_Video_xx',
               83: 'xx_Video_xx',
               84: 'xx_Video_xx',}
# write
for i in range(len(events_onset)):
    # the last one should be skip
    print(i)
    if i == (len(events_onset)-1):
        print('skip')
        break

    # crop & convert sample point into seconds & save as .fif
    raw_cropped = raw.copy().crop(tmin=events_onset[i][1]/raw.info['sfreq'], 
                           tmax=(events_onset[i+1][1]-1)/raw.info['sfreq'])
    raw_cropped.load_data()
    filename_tmp = filename_dic[events_onset[i][0]] + str(i) + '_raw.fif'
    raw_cropped.save(pathlib.Path('out_data') / filename_tmp, overwrite=True)

    # read back
    raw_path_tmp = pathlib.Path('out_data') / filename_tmp
    raw_temp = mne.io.read_raw_fif(raw_path_tmp)

    # Dan's code 
    events_temp = mne.find_events(raw_temp, stim_channel='STI 014', shortest_event=1)
    event_annotations_temp = mne.annotations_from_events(events=events_temp, 
                                                     sfreq=raw.info['sfreq'], 
                                                     orig_time=raw.info['meas_date'])
    raw_temp.set_annotations(raw_temp.annotations + event_annotations_temp)
    raw_temp.pick_types(eeg=True)

    # write BIDS
    raw_temp.info['line_freq'] = 60
    out_path = pathlib.Path('out_data') / (filename_dic[events_onset[i][0]] + str(i))
    bids_path = mne_bids.BIDSPath(subject='01',
                                  session='01',
                                  task='test',
                                  run='01',
                                  root=out_path)
    mne_bids.write_raw_bids(raw_temp, bids_path, format='auto', overwrite=True)

The results are:

Note, for 82, the data has two events (my understanding is task82 has been implemented twice). So, we have 12 files rather than 11.

Correct me if I misunderstood something.

georgebuzzell commented 3 years ago

@yanbin-niu Ah, yes, I see your point! My apologies, this was based on a misunderstanding on my end. I did not realize that "sti 14" did not contain latency information. That is, I believe that you are correct that "Sti 14" only contains the task codes. A combination of the event codes and the event latency is only created later using "mne.annotations_from_events" as @DMRoberts previously pointed out, right?

Taking the above into consideration, I think the way that you implemented the splitting is great! This appears that it works well to me. However, I will double-check on the number of tasks.

@yanbin-niu would you be able to push your code to a branch with your name so that we can test it out? Again, thank you for working on this, and my apologies for the confusion over "Sti 14"!

DMRoberts commented 3 years ago

@georgebuzzell @yanbin-niu I believe the event structure derived from 'STI 014' does have latency information, it returns a 3D numpy array, where the first column is sample latency, and the third column is the value of the event. The second column just represents the value in the event channel prior to the event onset, so it's not really of interest to us:

array([[   2486,       0,       1],
       [  51474,       0,       2],
       [  61473,       0,       3],
       ...,
       [2787602,       0,      34],
       [2789760,       0,      44],
       [2891295,       0,      45]])

But @yanbin-niu I think what you may have noticed is that the events we're interested in here (like 90, 91, etc.( seem to be missing from the list of events derived from 'STI 014', even though they are present in their own individual channels, as you showed above. Instead of values like 90, 91, etc., the numeric codes only go from 1 - 45.

I looked into this a bit more, and it seems that a mapping from channel name string to numeric event code is produced when we originally load the MFF file, but that then isn't preserved when we save the MFF to FIF:

on page: https://mne.tools/stable/generated/mne.io.read_raw_egi.html The event_id assignment equals np.arange(n_events) + 1. The resulting event_id mapping is stored as attribute to the resulting raw object but will be ignored when saving to a fiff. Note. The trigger channel is artificially constructed based on timestamps received by the Netstation. As a consequence, triggers have only short durations.

So we have to go back to the MFF to FIF stage, but we can then use that info to label the annotations with the event names rather than just integers from 1 to 45:

import mne.io

input_path = '/Users/danielroberts/Downloads/NDARAB793GL3.mff'
output_path = '/Users/danielroberts/Downloads/NDARAB793GL3_eeg_raw.fif'

# here, if I didn't explicitly specify not not exclude any channels, MNE was deciding to exclude channel 90
data = mne.io.read_raw_egi(input_path, preload=True, verbose=True, exclude=[])
# get the mapping from channel name to integer event ID
event_keys = data.event_id
# extract the events from the 'STI 014' composite channel
events = mne.find_events(data, stim_channel='STI 014', shortest_event=1)
# reverse the dict so that the keys are items and items are keys, stripping any trailing whitespace from the event strings
event_map = dict((v,k.rstrip()) for k,v in event_keys.items())
# use the event numpy array with integers, and the event_map going from integer to string names, to create annotations with the correct labels
event_annotations = mne.annotations_from_events(events=events, sfreq=data.info['sfreq'],
                                                orig_time=data.info['meas_date'], event_desc=event_map)
# add the event annotations to the existing annotations
data.set_annotations(data.annotations + event_annotations)
# keep only the EEG channels, removing the stimulation channels
data.pick_types(eeg=True)
data.save(output_path, overwrite=True)

DMRoberts commented 3 years ago

This is just an adapted version of @yanbin-niu 's code to crop the labelled FIF data and write to BIDS.

I just made a few changes:

Rearranged things a little to pull events from FIF annotations
I put the task name into the 'Task' field of the BIDS output
Following putting task names in the BIDS field, I realized that dashes aren't allows in BIDS task names, so I removed them
All tasks are written to the same 'BIDS Path' which puts them into the same folder per participant
Removed the FIF files after the BIDS file is written

I think if people have multiple blocks of the same task (for example, the contrast change task), those should probably be separate 'runs' of the same 'task' in BIDS conventions?

import mne
import mne_bids
import numpy as np
import os
import pathlib

root_dir = '/Users/danielroberts/BIDS_TEST/'
fif_path = '/Users/danielroberts/Downloads/NDARAB793GL3_eeg_raw.fif'

# mapping of event codes to block starts
filename_dict = {90: 'RestingState',
                91: 'SequenceLearning',
                92: 'SymbolSearch',
                93: 'SurrSuppBlock1',
                94: 'ContrastChangeBlock1',
                95: 'ContrastChangeBlock2',
                96: 'ContrastChangeBlock3',
                97: 'SurrSuppBlock2',
                81: 'Video1',
                82: 'Video2',
                83: 'Video3',
                84: 'Video4'}

# read the FIF
data_fif = mne.io.read_raw(fif_path)
data_fif.info['line_freq'] = 60

# get events from the annotations and convert to integers on load
events = mne.events_from_annotations(data_fif, int)
events_onset = events[0]

# reduce to only the events that indicate block starts
block_events = events_onset[np.isin(events_onset[:,2], list(filename_dict)),:]

# each block end is the time of the next block start
# the first block starts at 0, the final block ends at the maximum of the recording
block_starts = block_events[:,[0]] / data_fif.info['sfreq']
block_starts[0] = 0
block_ends = np.roll(block_starts, -1)
block_ends[-1] = (data_fif.n_times-1) / data_fif.info['sfreq']

# assemble an array with an event code, start time (in seconds) and end time (in seconds)
block_data = np.hstack([block_events[:,[2]], block_starts, block_ends])

# write out each block as a separate file
for b in block_data:

    task_name = filename_dict[b[0]]

    # crop & convert sample point into seconds & save as .fif
    raw_cropped = data_fif.copy().crop(tmin=b[1], tmax=b[2], include_tmax=False)
    temp_path = pathlib.Path('out_data') / (task_name + '_raw.fif')
    raw_cropped.save(temp_path, overwrite=True)

    # read back
    raw_temp = mne.io.read_raw_fif(temp_path)

    # write BIDS
    bids_path = mne_bids.BIDSPath(subject='01',
                                  session='01',
                                  task=task_name,
                                  run='01',
                                  root=root_dir)
    mne_bids.write_raw_bids(raw_temp, bids_path, format='BrainVision', overwrite=True)

    # remove the temporary FIF
    os.remove(temp_path)

yanbin-niu commented 3 years ago

@DMRoberts I just tried what you mentioned about mapping the task id (90,91...) to events code (37,38...). And it actually works. So, we can find a more elegant way to get the onset sample point, :

import mne 
from mne_bids import BIDSPath, write_raw_bids
import argparse, sys, os
import pathlib
import matplotlib
import mne_bids
import numpy as np

raw_path = pathlib.Path('out_data') / 'eeg_test_raw_event_corrected.fif'
raw = mne.io.read_raw_fif(raw_path)

# get events from annotations
events, event_id = mne.events_from_annotations(raw)

# tasks list
task_list = np.array(['90','91','92','93','94','95','96','97','81','82','83','84'])

# corresponding event list
events_list = [event_id[i] for i in task_list]

# get array of task id and onset sample point & change type & sort
events_onset = events[np.isin(events[:,2],events_list)]
events_onset = np.asarray(events_onset, dtype = np.int64)
index = np.lexsort ([events_onset[:,0]])
events_onset = events_onset[index]
events_onset

Here, I get an array of task makers and onset sample point:

This array is similar to the one that I got from working with individual stim channels. But, I do like this way better, since it seems more robust and looks more nice (looking is really IMPORTANT).

Btw, when I said latency, I meant duration. I just got confused with some terms we used......

@georgebuzzell "sti 14" has onset sample point, duration, and event codes.

I pushed my first version code (integrate @DMRoberts 's way of mapping task codes to event codes) and you can test it. you should go from 01-mff_to_fif.ipynb and put the NDARAB793GL3.mff in the folder where you put .ipynb code. Then run 02-fif_to_BIDS. Then you should get the results.

Or you can test with @DMRoberts 's code -- I just had some problems with his most recent code, I think I may miss some definitions, since @DMRoberts only gave some parts of the code. Or @DMRoberts can push the complete code into a branch.

yanbin-niu commented 3 years ago

@georgebuzzell I retired @DMRoberts 's code and realized that 1) we want to put all the task files under one BIDS folder (Dan's way) rather than separate BIDS folders (my way); 2) I should not skip the last event as I wrote in my code. So, Dan's code is definitely the way to go :).

I push @DMRoberts 's code. So, you should go from 01-mff_to_fif.ipynb and put the NDARAB793GL3.mff in the folder where you put .ipynb code. Then run 02-fif_to_BIDS_dan.ipynb. Then you should get the results.

@DMRoberts one quick quest, for task82, my understanding is that task82 has run twice (because they have different onset sample point). So, I guess we should think a way to deal with this situation (run01&run02)? since with your way, the second run of task82 will overwrite the first run. Correct me if I understand it wrong.

georgebuzzell commented 3 years ago

@DMRoberts and @yanbin-niu THANK YOU, BOTH, for working so hard on this. Awesome progress here.

@DMRoberts in response to your question: "I think if people have multiple blocks of the same task (for example, the contrast change task), those should probably be separate 'runs' of the same 'task' in BIDS conventions?" - Yes, that would be my understanding, so let's go with that!

@yanbin-niu just to clarify: You are suggesting to go with the method that @DMRoberts posted above, as opposed to the method that you posted above, correct? Or are you suggesting a hybrid of the too? Sorry, I just want to be 100% sure.

@yanbin-niu in response to your question "one quick quest, for task82, my understanding is that task82 has run twice (because they have different onset sample point). So, I guess we should think a way to deal with this situation (run01&run02)? since with your way, the second run of task82 will overwrite the first run" - Yes, we do want to deal with this and I think the method that all three of us are agreeing on is to denote repeated runs of the same task as just that "runs" which is defined in bids as well. For all tasks that have only one "run" I don't think we need to specify run, however, for any that have repeats, we should define the first as "run-01", the second as "run-02", etc. It might be best, though (and easier?) to just default to having all tasks defined as "run-01" and then any with repeats "run-02"; thoughts? @yanbin-niu and @DMRoberts

georgebuzzell commented 3 years ago

Just to clarify, I am suggesting the following naming convention for the files:

sub-[subNum]_task-[taskName]_run-01.[fileExt] sub-[subNum]_task-[taskName]_run-02.[fileExt]

I am suggesting that ALL files have "run" appended and to let it default to "01" but that any repeats would be "02"

@DMRoberts and @yanbin-niu, agree? Other thoughts? Thanks again for your work here!!

georgebuzzell commented 3 years ago

By the way, I was reading more on BIDS, and there is actually a policy to deal with the redundancy of meta-data files that I mentioned at the last meeting. For example, we don't actually need to write coordsystem.json file for each task, as long as all tasks use the same coordinate system. Instead, the rule in BIDS is that if you define a given metadata file at a higher level folder (the dataset level, or subject level) that metadata is "inherited" by all lower folders. So, instead of creating a coordsystem.json file for every task, it can simply be placed in "sub-[subNum]" folder and named without a task name, such as: "sub-01_coordsystem.json" and then, a user knows that this applies to all files inside that subject's folder. Similarly, if same coordsystem.json file applies not only to all tasks for a given subject but also to all subjects, it can be added at the dataset level, without any "sub-[subNum]" key code (i.e., "coordsystem.json"), which means it will be inherited by all subject and all task files (unless otherwise overwritten).

Relatedly, you can still include a coordsystem.json file at the subject and/or task leve, but you only need to if you want to over-ride something defined at a higher level. For example, you could have a dataset-wide "coordsystem.json" at the dataset level, but then, if the coordinate system differs for subject-05, you would simply include their unique coordsystem file in their "sub-05" folder and be sure to name it specific to that subject (i.e., "sub-05_coordsystem.json").

When we (or other researchers) are reading a bids formatted dataset, we will write our code such that it loads in metadata at the highest level it exists at and apply that to all analyses unless a more specific metadata file is identified at a lower level. I think the reasoning behind this philosophy is to help with the organization of the folder structure, reduce redundancy (and number of files) and the amount of reading/writing of meta data files. This all refers to the "inheritance principle" of BIDS, and you can read more in the section entitled, "The inheritance principle" here: https://bids-specification.readthedocs.io/en/latest/02-common-principles.html

Based on what I describe above, I think we should take this into account when finalizing the bids conversion script. With that said, if others think it is too much to incorporate this right now, then I am also fine with continuing in the way we have been, as the only downside to our approach is that it is overly redundant, but crucially, it is not wrong. That is, our approach is still bids-compliant and is basically just using bids in a way where we "over-ride" any default metadata for every single file. Please let me know if this all makes sense, and what your preferences are @DMRoberts @Jonhas @yanbin-niu

yanbin-niu commented 3 years ago

@georgebuzzell Do use @DMRoberts 's code (02-fif_to_BIDS_dan.ipynb). Before running this file, run 01-mff_to_fif.ipynb and put the NDARAB793GL3.mff in the folder where you put .ipynb code. The tiny glitch is that does not take multiple runs into consideration, (but, really not a big deal, definitely can tweak it later).

DMRoberts commented 3 years ago

@yanbin-niu , you are correct in that I didn't properly deal with the duplicate of code 82 as you had, I my example I had just overwritten the file with the most recent 82 block, which will have to be changed.

@georgebuzzell I don't think the example here is BIDS compliant: sub-[subNum]_task-[taskName]_run-01.[fileExt]

Since it is missing the session number. I think session number, subject number, task, and run are always populated, even if they are all the same (ie session or run are always 1). We don't specify the filename, we just specify those values and the BIDS exporter generates the filename. Below is an example of what is produced by the export code from yesterday:

@georgebuzzell @yanbin-niu One option of dealing with the duplicate task events, like 82 in the example data, is to increment the run # as was mentioned. One potential issue is that some tasks already have multiple blocks. So for example in this file we have three blocks of the 'contrast change' task, which have three separate start codes. I might expect that those would be three 'runs' of the same contrast change 'task'. But using 'run' to indicate block would then interfere with using 'run' to indicate repeat.

georgebuzzell commented 3 years ago

@DMRoberts and @yanbin-niu Thank you so much for identifying these important edges cases and issues to discuss and resolve; very helpful and I really appreciate your attention to detail here!!

@DMRoberts I am totally fine to go with the naming convention you suggest (where sub, ses, task, and run) are always specified. My reading of BIDS (which is far from exhaustive, so I could be wrong) is that you actually do not always have to write all of those keys into the file name (though perhaps mne-bids requires them)? Alternatively, perhaps they were not all required in an earlier iteration of bids, but are now required (i.e. see original bids fMRI paper that does not include ses)?

Regardless, I am 100% fine with moving forward with the naming convention as depicted in the image you have pasted (per my prior comment, my understanding is that this reflects a special case of BIDS with redundant naming).

Regarding how to deal with repeat events. I think that is a good point you make, about "blocks" being equivalent to "runs". If I understand you correctly, you are suggesting that we should not write repeated blocks as separate tasks, but instead, write them as separate runs of the same task. I do agree with doing this. But also, I don't actually think this poses an issue with also referring to a repeat of any task that was not "designed to repeat" (e.g. 82) also as an additional run. I think that is exactly what it is, right? The only difference being that it was unplanned and likely occurred because of an issue in the first run of it. I would suggest that we can simply address this potential issue with a metadata file at the data set level specifying the a priori task design (i.e., which tasks should or should not have repeated), which then allows the user to infer which repeated runs are intentional and which are not. @DMRoberts thoughts?

Relatedly, if we move forward with saving different blocks as different "runs" of the same "task" (which I think we should) then please note that we need to do this for BOTH the Surround Suppression Paradigm (93 and 97 corresponding to block/run 1&2) AND the Contrast Change Paradigm (94,95,96 corresponding to block/run 1,2,&3).

93 = Surround Suppression Paradigm Block 1; 94 = Contrast Change Paradigm Block 1; 95 = Contrast Change Paradigm Block 2; 96 = Contrast Change Paradigm Block 3; 97 = Surround Suppression Paradigm Block 2;

@DMRoberts and @yanbin-niu curious of your thoughts! More importantly, thanks again for identifying these important issues/edge cases to discuss and resolve; very helpful!!

DMRoberts commented 3 years ago

@georgebuzzell I think you're right that session isn't required. I looked in the spec, and only Subject and Task are actually required it seems: https://bids-specification.readthedocs.io/en/stable/99-appendices/04-entity-table.html

The MNE-BIDS exporter has a field for session, but it seems that if you enter it as None it won't include session, which also then doesn't create the session folder inside the subject folder.

The issue I see with the repeats, is that say they have 3 blocks of a task, but the second block was started twice by accident, so there are two 'block 2 start' markers. Does the repeated second block marker become 'run 3'? And then the third block marker becomes 'run 4'? But what if someone later wants to match behavioral data to those blocks, I could see it being confusing which 'run' actually corresponds to 'block 3' in the behavioral data.

I guess one other option might be to keep all blocks of a given task together in the same 'task' file.

georgebuzzell commented 3 years ago

@DMRoberts OK, cool. Thank you for confirming that session is not required.

I think it makes sense, given use of the MNE-BIDS exporter to continue specifying ses (session); I only wanted to clarify what was/was not bids-compliant. But again, I agree we should move forward with always defining ses.

Regarding the repeated tasks/blocks, I do not think there is a perfect solution here. However, given that the "blocks" of the same task do not always happen right after each other, and because they each have unique task marker numbers, I think it is better to go back to treating each unique task number (regardless of whether it has the same base task name but a different block number) as distinct tasks. Then, we only treat repeated task numbers as additional "runs". Does that all make sense? Do you agree @DMRoberts ?

Thanks again to you @DMRoberts and @yanbin-niu for your help/discussion here!

DMRoberts commented 3 years ago

@georgebuzzell I think that makes sense, to treat each unique task start event as it's own 'task', and to increment the run number in the cases that a start event is duplicated, as was the case with event 82 here.

georgebuzzell commented 3 years ago

Thank you, @DMRobert!

OK, all (@DMRoberts @Jonhas @SDOsmany @yanbin-niu @DavyNeat @stevenwtolbert @Pranjali051), I think this is our final plan then:

The naming convention for each file will be as follows: sub-[subNum]_ses-[sesNum]_task-[taskName]_run-[runNum].[fileExt]

In the above naming convention, -sesNum will default to "01" for all data, as there is only one session -taskName will be the task name corresponding to a unique task marker code, which includes block #. That is, we will NOT be saving different "blocks" as different "runs" of the same task. -runNum will default to "01" for the first instance of a unique task marker code. However, any repeats of the same marker code will iterate runNum.

For each task file, there will be three brain vision formatted files (.eeg, .vhdr, .vmrk), one channels.tsv file, one eeg.json file, one events.tsv file, one electrodes.tsv file, and one coordsystem.json file. This will produce some redundancy in some of the metadata files across tasks, but this is still bids-compliant and totally fine for now.

We are most of the way there, however, I think the next steps that are needed are to modify the code in Yanbin's branch (dev-feature-io-yanbin), specifically this code here: 02-fif_to_BIDS_dan.ipynb. This code needs to be modified so that repeated runs are not overwritten. Additionally, this code needs to be modified to ensure that all metadata files are being written, in line with this other issue here: https://github.com/NDCLab/baseEEG/issues/18

Minor comment: I think it would be helpful for us to keep things straight if we try to shift towards pushing any code fragments to a branch with your own name and then linking that in your comments, as opposed to ONLY pasting code into the comments. It was just a bit difficult to wrap my head around where "dan's code" was, where "Yanbin's code" was. Note that what we are calling "Dan's code" here is actually a file in Yanbin's branch! :) @yanbin-niu thank you for regularly pushing your code; very, very helpful! @DMRoberts very, very helpful insights and code here from you (obviously... Thank you!); to make sure that your efforts are properly incorporated and not "lost" I think it would be helpful to push your code suggestions in the future. This will also ensure that credit is properly assigned at the end of the project, which I think is very important. However, please do not hesitate to let me know if you disagree and/or have an alternative suggestions!

georgebuzzell commented 3 years ago

Final step for feature-io:

Make changes above, and also those listed here: https://github.com/NDCLab/baseEEG/issues/18

When these changes are made, please link both these issues in your pull request.

NOT INCLUDING Additional meta data files at subject level. Additional edits to eeg.json file. Optional: revising path structure, revising whether writing/rewriting fif.

@DMRoberts

DMRoberts commented 3 years ago

I pushed my branch in case anyone wants to take a look. It assumes that HBN_R1_1_Pheno.csv and NDARAB793GL3.mff are in the same directory as the import code. HBN_R1_1_Pheno.csv contains participant metadata, and is located at: https://drive.google.com/drive/u/0/folders/1e0BrDX6GMlDhR2OiL_m-ogWMEiHKD6JE

I haven't done a pull request yet because the script still seems to be killed when running inside the Docker container, due to memory issues. I'll try to see if there is a way we can alleviate that.

georgebuzzell commented 3 years ago

@DMRoberts thanks so much!! I look forward to testing your code!!I assume I will have the same issue on my local comp, though, as I only have 16 gb ram, and I think that is your limit too? Hopefully the XSEDE allocation will be approved soon. If so, I can try running there. Alternatively, hopefully the HPCS allows us to access the high mem nodes soon.

To clarify, your code runs fine outside of the docker, correct?

DMRoberts commented 3 years ago

@georgebuzzell Yea, it does run outside the Docker container. I'm going to look into the Docker settings a bit more, I think it's being Killed due to lack of memory, but I don't get an explicit message about the reason.

DMRoberts commented 3 years ago

@georgebuzzell Ok, I was able to get the script to run in the Docker container.

By default, I believe the Docker host on Mac only allocates 2GB of memory to Docker, which is used across running containers. I had previously increased this value in an attempt to get the script to run (before we split up writing of the files by task). There is also a --memory parameter when running the Docker container which I thought did the same thing, which I had been using more recently. However I now think that the 'memory' flag actually only places a max limit on the amount of memory, but doesn't increase it beyond what Docker in total has been allocated.

So in short, in Docker, in Preferences > Resources > Advanced, I increased the amount of memory Docker can use to 16GB. I could then run the script. 16 GB may not be necessary, maybe 12 GB would work.

The script also took much longer to complete in the Docker container, about 28-29 minutes, vs. about 2 minutes to complete when running natively. I think some of this might be able to be alleviated by putting the data in a Volume:

https://stackoverflow.com/questions/55951014/docker-in-macos-is-very-slow https://docs.docker.com/storage/volumes/

georgebuzzell commented 3 years ago

@DMRoberts Thank you for noting this. One question though: you suggest using volumes for data access; is that not what you were already doing? I.e. were you not using -v when running the docker? Or are you saying that we need to be more specific in additional parameters for the volume?

DMRoberts commented 3 years ago

@georgebuzzell I ran the Docker container with the command in the readme:

docker run --rm -it -v $(PWD):/projects -w /projects dockerImage:Version bash

The Docker docs say -v means Bind mount a volume

The terminology is a bit confusing, because -v is described as 'bind mount a volume', but in other places, it seems that bind mounts are described in contrast to volumes, e.g.:

Here: https://docs.docker.com/storage/volumes/ :

Volumes are the preferred mechanism for persisting data generated by and used by Docker containers. While bind mounts are dependent on the directory structure and OS of the host machine, volumes are completely managed by Docker. Volumes have several advantages over bind mounts:

Or Here: https://docs.docker.com/storage/ :

Volumes are stored in a part of the host filesystem which is managed by Docker (/var/lib/docker/volumes/ on Linux). Non-Docker processes should not modify this part of the filesystem. Volumes are the best way to persist data in Docker.

Bind mounts may be stored anywhere on the host system. They may even be important system files or directories. Non-Docker processes on the Docker host or a Docker container can modify them at any time.

I think what we're currently doing is a bind-mount, which links the host file system to the file system inside the container. This allows someone to edit / save the code in their native environment and run it inside the container, but it also makes loading and saving the data very slow.

georgebuzzell commented 3 years ago

@DMRoberts this is really helpful. It sounds like this is essentially what I brought up at our last meeting on 4/9? Or perhaps I am misunderstanding here?

I see your point that the terminology gets quite confusing here! Whatever we want to call it (and I think we want to call it "volumes") we want to use docker in such a way that our code running inside docker references directories inside docker, which are not dependent on the local machine's path structure. I think what we need to confirm, is exactly what code will allow for that, as there seems to be some confusion as to whether the way we are currently running the docker image does this or not. @Jonhas any thoughts here?

I will look closer at the documentation you sent, @DMRoberts. Very helpful. Thanks!!

georgebuzzell commented 3 years ago

@DMRoberts I ran your code; looks great!! Did not time exactly, but the difference in time between writing the readme and the last file was about 15 minutes. Definitely slow...

I still need to check on the pheno file.

georgebuzzell commented 3 years ago

@DMRoberts I uploaded a second pheno file, which contains the pheno data for this participant. Please see here: https://drive.google.com/file/d/1xdw4WkX2cjX7cKDNQemTb5TqtsLMYZrQ/view?usp=sharing

Basically, CMI releases these data in batches, and somehow, I pulled example files for us to work with from one batch, but uploaded the pheno file from another batch; my apologies!