NDCLab / pepper-pipeline

tool | Python Easy Pre-Processing EEG Reproducible Pipeline
GNU Affero General Public License v3.0
3 stars 3 forks source link

Feature-badchans #25

Closed georgebuzzell closed 3 years ago

georgebuzzell commented 3 years ago

Feature-badchans -auto-detect and remove bad channels (those that are “noisy” for a majority of the recording) -write to annotations file to indicate which channels were detected as bad, what metrics indicated they were bad, and whether they were removed

georgebuzzell commented 3 years ago

@yanbin-niu please do not hesitate to reach out to me about this. Happy to jump on a call or comment here. Thank you!!

georgebuzzell commented 3 years ago

@SDOsmany Please let me know if you needed any help here!

SDOsmany commented 3 years ago

@georgebuzzell Hello for auto-detecting bad channels I was looking at autoreject mentioned by MNE here 4th bullet point

I'm not sure as to what the outcome of this feature is or what it entails if you could please help clarify that.

georgebuzzell commented 3 years ago

@SDOsmany I have read a bit about autoreject before but have not read the full paper yet. I will do that. Can you please read it as well and we discuss?

Basically, the purpose of this feature is to detect bad channels, not interpolate them. Channels can be bad based on a given that they exceed a given voltage threshold, or display a lack of correlation to nearby channels, or show abnormal spectral characteristics.

I think that the easiest path forward, is indeed to use autoreject, as you suggest, in order to detect the bad channels. The one catch though, is that we DO NOT want to interpolate those bad channels as part of this feature. the reason is, that this step will be run prior to ica, and if you interpolate the rejected data, it will add non-unique data to the eeg which will mess up ica (ways around that, but the ideal, imho is to not interpolate pre-ica).

So, I would suggest using autoreject to detect bad channels, and then to remove those bad channels, and to write to the annotations file which channels were rejected, as well as for each channel that was rejected which criteria it exceeded threshold for in order to reject (and the value for that channel).

Again, i think a good next step is to read the autoreject paper they mention in the documentation and then we go from there. however, please let me know if you have further questions.

@F-said given that you are working on the "final-reject" feature, I think you should also read the autoreject paper and perhaps contribute to this reject-badChans feature as well?

F-said commented 3 years ago

@georgebuzzell @SDOsmany From a look through of the autoreject module, it looks like its possible to automatically detect bad channels without interpolation using

ar.fit(epochs)
reject_log = ar.get_reject_log(epochs)

The "bad" channels that autoreject detects will be stored in the reject_log. Then we take over using mne and drop all the bad channels from raw by iterating over the channel names and removing:

for label in reject_log.ch_names:
    raw.drop_channels(label)

A couple of questions @georgebuzzell

Thanks!

yanbin-niu commented 3 years ago

@F-said For your montage file question, try raw.set_montage('standard_1020') . I thought George has mentioned somewhere 'standard_1020' is right montage for our sample data.

georgebuzzell commented 3 years ago

@F-said Thanks; great work here! Also, I see this comment is a few days old; please do remind me about comments I do not respond to within 24 hours (and especially if no resp within 48 hours)! :)

For this feature, we are actually trying to detect bad channels based on continous data (before segmenting/epoching). Please note that autoreject could be used on epoched data for later stages of bad channel detection based on segmented data, such as in prepping for ica (@yanbin-niu ) or for the final reject stage (@F-said ).

However, we want to reject based on continuous data for this stage, not epoched data. I believe it should be possible to run autoreject on continuous data, but I have not checked. Can you please confirm that is the case, @F-said ?

Yes, it would be fine to just mark channels as bad but still retain the data. In fact, that would be preferred! As a general principle, it is ideal to process the data in such a way as we create a bunch of annotations for users to use as they see fit at a later stage (i.e. to actually remove vs not remove bad chans). That said, we would want to be sure that bad channels are not included in the ica analyses, or interpolation analyses, or re-ref analyses of later steps. Please check how other features would deal with the presence of bad channels that are just marked as bad. It might be easier to just remove them... but check and follow up here?

Sorry, I am not sure i totally understand the question about channel, channel label, and channel type? Not that it is a bad question, just that this topic gets tricky, so I want to make sure I am answering the right question! :) Basically, "electrode" refers to the physical recording location and its name (electrode/channel label); "channel" refers to the recroded signal from a given electrode, and is defined by the electrode location electrode/channel name, and any other aspects of the recording signal, such as impedance. "channel type" refers to whether a recorded channel (the data over time) comes from eeg, meg, etc. Those are the formal definitions, as far as mne uses them (or my understanding of how they use them. Fieldtrip uses a similar set of formal definitions. HOWEVER, in the eeg field and for other software, e.g. EEGLAB, "channel" and "electrode" are seen as the same thing. So, often when I say "channel", or "electrode" I am using these as interchangeable with each other, and interchangeable with "channel name". I will try to get better about this... but I "grew up" on EEGLAB where the distinction was not as crucial, or, at least not as crucial for our purposes. However, I will try to be more precise going forward! :)

For the montage file, @yanbin-niu is correct that standard 10_20 should work, however, please correct me if you think easycap-m1 is preferred for some reason for these data (could be the case). Please see the full discussion on this issue https://github.com/NDCLab/baseEEG/issues/29 for how to identify the proper montage (and how we need code to pull the proper montage, based on the bids data; @Jonhas may or may not have already written some/all of this, so check with him as well?).

SDOsmany commented 3 years ago

@georgebuzzell Sorry for not reminding you

I did some research and found this GitHub issue where the person has a similar problem to us I think here

I was looking through the AutoReject documentation and I couldn't find a function that didn't use epochs. I assume what DFPV is suggesting on the github issue here Is to use make_fixed_length_events to turn the data into a really long epoch with all the data?(please clarify this if you can) and pass it to the auto-reject function that expects epochs? This doesn't seem to me like it fixes our problem but since I don't understand it much I'm leaving it here if you guys wanna take a look.

@F-said thank you so much for the help with this issue if there is anything you need let me know. I will keep looking for a solution.

F-said commented 3 years ago

@yanbin-niu @georgebuzzell Thanks for catching me up on montage. @Jonhas if this is something you've been looking at would a function to automatically select MNE-provided montage files from BIDs metadata be acceptable?

I can make a PR for proper code-review and further additions if the following example looks like something we could start at:

def get_montage(cap_json: dict) -> mne.channels.DigMontage:
    """Function to find proper montage file from json file of meta-data 

    @ throws: 
        - "Exception" exception if "cap_json" is not an instance of "dict" that contains "CapManufacturersModelName"
    @ returns: montage file whose channel name set are a superset of the "raw" dataset's channel name set
    """
    try:
        # Get cap manufacturer model name
        model_name = cap_json["CapManufacturersModelName"]
        # Standardize montage file name
        model_name =  model_name.lower().replace("-", "").replace("_", "").replace(" ", "")
    except Exception:
        print("\"cap_json\" data must be dict that contains cap model details. Exiting")
        sys.exit(1)

    # Get set of standard montage files provided by mne
    montage_dir = os.path.join(os.path.dirname(mne.__file__), 'channels', 'data', 'montages')
    montage_files = os.listdir(montage_dir)

    # default to "standard_1020" if no exact file found 
    handle = "standard_1020"
    for file in montage_files:
        # Standardize standard montage file names 
        orig_name = file 
        montage_match = file[:-3].lower().replace("-", "").replace("_", "").replace(" ", "")

        # if found exact match, select this and break
        if montage_match == model_name:
            handle = orig_name
            break 
    return mne.channels.make_standard_montage(handle) 

with open("raw_data/task-matchingpennies_eeg.json", encoding="utf8") as file:
        cap_json = json.load(file)
montage = get_montage(cap_json) 
print(montage)
F-said commented 3 years ago

Thanks for the link @SDOsmany. As the dev suggested, looks like fitting entire channels into epochs would be a valid way to annotate continuous data if we want to use autoreject. Especially considering that the rejection threshold is selected strictly using epochs:

image

We could then use the resultant boolean list that you mentioned previously (the one that had 900 elements in total and 58 marked "true") to mark which channels were detected as "bad", except this time it will only have 10 elements.

Screenshot_7

@georgebuzzell It looks like future pipeline steps such as ICA will include bad channels unless explicitly removed, so we can then call:

raw.drop_channels(bad_chans)

where bad_chans contains our automatically detected channels from autoreject.

Lastly, thank you for the clarification on channel definitions @georgebuzzell. Upon a closer look, the docs use "channel-type" as you specified to indicate whether the channel contains "MEG" or "EEG" data.

georgebuzzell commented 3 years ago

@F-said the montage look up code looks good! please do a pull request!

georgebuzzell commented 3 years ago

@F-said and @SDOsmany I finally had a chance to look closer at autoreject and the fixed-length segment workaround. In short, this does not appear to be a viable solution for us. What this is doing, is chopping up the data into segments and then running auto reject. However, we want to keep the data continuous at this point in the processing (wihtin this feature).

@F-said I am not sure I totally follow your most recent comment of how to use AutoReject here. Can you please clarify if you are sementing the continous channel data into epochs or not?

I should say, regardless, that I think we will likely want to explore use of autoreject for the final-rejection stage, however, I don't think we want to use it (at least based on my current understanding, which could change) for this badChans feature, nor for the ica feature. Instead, I think it might have value at the finalRaj feature stage. Again, my understanding could change though...

SDOsmany commented 3 years ago

@georgebuzzell can you please expand on the idea you had last meeting for using FASTER?

I found this discussion while looking into FASTER maybe it will be helpful discussion

I also came across this mne function for finding bad channels, I need to test it but it might be useful mne documentation

georgebuzzell commented 3 years ago

Here is the code for faster in python: https://gist.github.com/wmvanvliet/d883c3fe1402c7ced6fc

@F-said @SDOsmany @yanbin-niu

F-said commented 3 years ago

Thanks for the links @SDOsmany. Tried find_bad_channels_maxwell(raw) on my end and ran into some exceptions. On a closer look it appears that this method should solely be used on neuromagnetic data:

Maxwell filtering was originally developed for Elekta Neuromag® systems, and should be considered experimental for non-Neuromag data. See the Notes section of the maxwell_filter() docstring for details.

But following the FASTER discussion, I imported the gist and it ran on it on the new BIDS file which seems to work. Just needed to replace deprecated mne functions nanmean with np.nanmean and find_outliers with mne.preprocessing.bads._find_outliers. Also some light-runtime warnings need to be inspected on my end:

image

In fact, creating a function with full-fledged documentation that simply calls this function would be a little redundant, so the bad_channel detection step can be condensed to:

from faster import faster_bad_channels

bad_channels = faster_bad_channels(epochs)
print(bad_channels)

assuming all other pipeline steps were executed. Made PR.

georgebuzzell commented 3 years ago

Excellent work @F-said and @SDOsmany !

georgebuzzell commented 3 years ago

@F-said and @SDOsmany I went back and checked the paper on FASTER and can confirm that the first step of FASTER runs on continuous data. If you look at the FASTER code you all pulled in, on lines 97-100 you can see that they actually concatenation the epochs into one long time series before running the analyses. So, you would just want to modify the code so that instead of expecting epochs as input, only continuous data is expected. That should work fine, but let me know if I can provide more clarity here about what I mean.

One thing to note, is that I am unsure how much this py instantiation of FASTER has been tested in general, so we will likely want to do more tests for this feature. Relatedly, it is unclear to me if the py code is accounting for channel location or not, if not, we will need to add that in. But for now, if you are able to just get the code running on continuous data that would be an excellent first step.

I went ahead and closed the pull request that you had open. Please make these changes and then open a new pull request. Thank you!!

F-said commented 3 years ago

@SDOsmany. FASTER algorithm now runs on continuous raw data. Modified lines 97-100 & parameters. Produces a different set of bad channels when running on continuous data.

Need to check if any function expectations are violated and remove RunTimeWarnings, but working so far. Running on feat-io faster_bad_channels produces:

Generated bad channels ['Cz', 'E101', 'E113', 'E117', 'E54', 'E62', 'E64', 'E66', 'E68', 'E72', 'E75', 'E76', 'E84', 'E85', 'E89', 'E95']

However, both the epoch version and the raw version of faster_bad_channels() fails recall tests. It does not detect the channels marked as default by bad to be bad:

True bad channels ['FC5', 'FC6']
Generated bad channels []
F
======================================================================
FAIL: test_recall (__main__.FasterTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:/Users/saidmf/Desktop/NDC/baseEEG/base_eeg/bad_chans_test.py", line 69, in test_recall
    self.assertGreaterEqual(recall, threshold,
AssertionError: 0.0 not greater than or equal to 0.9 : FASTER recall lower than threshold

Going to look more into the algorithm to find behavioral errors, but will this be a problem for the pipeline? @georgebuzzell

georgebuzzell commented 3 years ago

@F-said I would not worry too much if it is failing to find the channels marked bad in the mathcing pennies data set.

a better test will be for me to go through a few of the feature-io output files and manually identify channesl that it should identify as bad. then, see if faster identifies these. this is a more appropriate test, as faster will likely behave differently on low vs high density eeg data.

SO, for now, let's table further dev of this feature until you and i look at some files and manually identify some bad chans.

@F-said and @SDOsmany and anyone else that wants to look at data together manually, please let me know. It would be great to do this within the next couple of days. Before then, please table dev of this feature and work on others.

@yanbin-niu you might be interested to look at the data with us as well??

F-said commented 3 years ago

Slight amendment to the last comment! faster-bad-channels has great recall if variance sensitivity is increased by lowering the standard deviation threshold.

image

SDOsmany commented 3 years ago

@F-said wow, this is an amazing catch, really nice work on this, just curious what number are you using for the threshold? @georgebuzzell maybe the threshold can be a variable in the user_params.json?

@georgebuzzell I can help to identify bad channels manually, Can you explain how this can be done if you don't mind.

yanbin-niu commented 3 years ago

@georgebuzzell Sure! Very happy to do so. Will also take a look at the Faster algorithm.

Btw, I am also not clear on how to identify bad-channel manually as @SDOsmany. Or, do you want to double check the results from EEGLAB. I have EEGLAB installed and I will take a look at how to import our BIDS data into EEGLAB (do you think EEGLab receive BIDS format?)

F-said commented 3 years ago

@SDOsmany lowered the variance to 2, it seems any float < 3 is sufficient to detect manually found bad channels.

yanbin-niu commented 3 years ago

@georgebuzzell I tried to import our BIDS data into EEGLab, but it can not pass the BIDS validation (It's weird, since we did not change any time format while exporting BIDS data). image

So, I ended up using our NDARAB793GL3.mff rather than the data for a specific task.

A quick question related to the FASTER plugin in EEGLAB. I looked at the manual and set up step-by-step. Those interpolated channels are bad channels? I assume in FASTER, we need .3 hz filter rather than 1hz filter, correct?

image

I just want to double check the results with the ones that @F-said got. But, they are not comparable since I used the whole NDARAB793GL3.mff and I guess @F-said just used the data from the first task?

yanbin-niu commented 3 years ago

@georgebuzzell About the threshold that @F-said mentioned. In EEGLab FASTER, the default values are all 3, so I guess we should go with default value 3 as well?

image
georgebuzzell commented 3 years ago

@SDOsmany @F-said @yanbin-niu Thank you for your work here!

In short, 3 sd is what I have typically used. It makes sense though that a z score of 2 was needed for an eeg data set with only a few channels though (the matching pennies dataset). However, yes, we would want to go with 3 for the higher density data.

to @SDOsmany point, yes, that is definitely a parameter we want people to be ablet to modify, with the default at 3.

@yanbin-niu that is odd about not being able to import properly to eeglab based on a bids validator fail. It seems there may be an issue with the bids formatting of our data, @DMRoberts ? We should discuss this further.

@yanbin-niu I think you can technically use .3 or 1 hz for faster, but yes, i have typically used .3

yanbin-niu commented 3 years ago

@georgebuzzell Some updates:

  1. I checked out the FASTER paper:
    • before detecting the bad channels, FASTER does referencing and filtering; image
  1. Based on the two points I just mentioned, with EEGLab FASTER plugin, I set high-pass filter as 1Hz, no low-pass filter, no notch filter, re-referenced to 'E11' (I just selected this since I feel it's closer to Fz, correct me please if I should not use E11). The results are : image

32 55 62 68 107 117 have been identified as bad channels.

  1. Building on @F-said 's code, I deleted 'kurtosis', 'line_noise', since I do not think those two algorithm are used in FASTER, at least from what I read from the paper. So, the script just uses 'variance', 'correlation' and 'hurst'. Then, as what I did with EEGLab FASTER plugin, I set high-pass filter as 1Hz, no low-pass filter, no notch filter, re-referenced to 'E11'. Then run faster_bad_channels(raw). the results are: image

62, 64, 107 and 117 have been identified as bad channels, acknowledging the warning.

In sum, I do not think the python code did exactly same thing. In FASTER paper, it mentioned:

The above parameters were corrected for reference offset using the method described in (Junghöfer et al., 2000).

in the python code you provided, I did not see any correction. I assume this might be one reason.

I did this based on my understanding, it would be totally wrong. Please feel free to correct me if I am not on the right track.

Questions:

  1. should we use notch filter before detecting bad channels?
yanbin-niu commented 3 years ago

@georgebuzzell Some other updates:

  1. with EEGLab, I used the .mff raw data (since I did not find how to import .fif file in EEGLab..... and BIDS can not be used as well); with python code, I used the .fif data that was exported from .mff raw data. I am not sure whether this could be one source of differences (I personally do not think so).

  2. I checked the .m file. Thought not understand the correction algorithm and Matlab language, it seems that FASTER does use channel positions for correction. image

So, this could be one source of differences.

  1. The thing I did not understand: for "Hurst", EEGLab FASTER does not do any correction, but the bad channels that were detected by ONLY "Hurst" were 62, 68, 107 and 117. By comparison, in python version FASTER, the bad channels that were detected by ONLY "Hurst" were none.

So, they are different in some other ways.

georgebuzzell commented 3 years ago

@yanbin-niu wow, this is so helpful! Thank you!

Yes, it does look like the original faster algorithm adjusts for distance from the reference and that the python faster code we have does not. We will need to update this. It looks like the eeglab faster code is calling several matlab functions to do this, we will have to look st these and see how hard it would be to translate to existing python functions (e.g. scipy) or write our own.

I think that, it would not be a bad idea for us to reach out to both: 1) the person that write the faster python code, and 2) the original authors of faster. I can do that and cc you in the email?

Yes, we should filter first. High pass for sure, and a low pass of 40 hz. Let's avoid a notch filter, at least for now, as there can be issues with a notch filter.

Yes, when I did my initial tests with the eeglab faster years ago I always re referenced to fz because that is what the faster paper reccomended, though I am not sure how important that really is. For now, yes, not a bad idea to do that, though we may very well drop that step later as rereferencing early in the processing stream can cause other issues.

Yes, we only want to use variance, correlation and Hurst. Please note that faster is actually a full preprocessing pipeline, but we only want to use the channel rejection based on these three parameters.

For the possible data set differences, I was not totally clear. Are using the really long datasets with all the tasks when running in eeglab, but then only using one of the task files for python testing? Or, are the lengths of the files the same, and only the format different? If the latter, then I dont see any issue, but the former would be. Although, actually, we would want to be sure that periods of no data between tasks are not being assessed by the algorithm and that could differ between formats.

Again, awesome work here!

yanbin-niu commented 3 years ago

@georgebuzzell Sorry for not being clear related to data set differences. Yes, they are just different in format. Specifically, I used the whole dataset (.mff) in EEGLab and also used the whole dataset (.fif exported from .mff without any further processing) for python version.

Reaching out to both authors are great idea. Happy to be cced.

georgebuzzell commented 3 years ago

OK, great. Good to know that, @yanbin-niu !

This could still lead to differences, though, if the EEGLAB faster is ignoring the periods of missing data, or accounting for the discontinuity in the recording in some way, and python is not.

When you import the data in eeglab, and plot it, can you see the periods of time with a bunch of zeros when the recording is paused? Also, do you see any "boundary" markers? Also, in the eeglab python code, is there any code to identify the presence of boundary markers? Also, anywhere in the python code to deal with discontinuities in the eeg?

One easy solution, to be sure none of this is an issue. is to load in the file, and then cut it, so that you are only looking at one task period, with none of the zeros. If you do that the same for both the eeglab data and the mne python data, then, run both faster algorithms, we could be sure they are being dealt with in the same way. does that make sense? If no, please let me know! :)

yanbin-niu commented 3 years ago

@georgebuzzell Thanks for pointing a potential source of the differences. Actually, I am not sure how to overview the whole dataset and how to check "boundary" markers in EEGLab. I just set the time range to 6000s and 600s respectively, as plotting below respectively:

image image

Clearly, I can see several boundaries in 6000s time period. In 600s time period, I can not see "zeros". So I think you are right that EEGLAB faster is ignoring the periods of missing data??

One question before I cut the data and try both ways: do you know whether EEGLab can import .fif format data? since mne can only save to .fif format (https://mne.tools/stable/auto_tutorials/raw/10_raw_overview.html#sphx-glr-auto-tutorials-raw-10-raw-overview-py)

Due to limitations in the .fif file format (which MNE-Python uses to save Raw objects), ......

I just want to make sure that both ways can work from exactly same data format. I did some searching on google early on, but did not find any information related to how to import .fif format into EEGLab.

Oh, another minor thing: the EEGLab FASTER only receive .set (correct?). So, I think I can cut the data in EEGLab. And save the cut data as .set data format in EEGLab. Then, I can import .set data in mne and run the python code. Does this make sense? Do you think it is a feasible way?

Though I have never used EEGLab to cut the data, I assume this tool can be used to cut the data based on the time, correct?

image

Or, there is a way to import .fif data into EEGLab. I can cut the data in MNE, which I think is more safe, just in case EEGLab will do further processing/clean. Does this make sense?

georgebuzzell commented 3 years ago

@yanbin-niu Yes, looks like eeglab ignores those periods. So it would be great to cut a period that does not have any boundaries and use the same segment in both eeglab and python.

"Oh, another minor thing: the EEGLab FASTER only receive .set (correct?). So, I think I can cut the data in EEGLab. And save the cut data as .set data format in EEGLab. Then, I can import .set data in mne and run the python code. Does this make sense? Do you think it is a feasible way?" -yes, this should work great!!

and yes, pop_select() is what we use to select a piece (cut) the data; perfect!!

I am not sure if you can import fif into eeglab. maybe? but I am not sure. For now, though, I think the above method should work. Basically, just try to find a fairly long piece of data (at least one minute, but ideally more like 10 mins) of data in eeglab that does not include any boundary markers. then, select just that piece of data and save it. Then, bring that same piece into python. Run that piece through python and eeglab versions of faster and see what we get.

Does that make sense? Again, very impressed by your work! :)

Please note, that the above test will be very helpful. But, even if this leads to more similar results across eeglab and python, we are still going to need to add in the adjustment for channel locations in the python code. But, one step at a time... and I think that running the above test will be very helpful as a first step.

thank you!!

yanbin-niu commented 3 years ago

@georgebuzzell Some updates:

  1. Based on what Dan worked on, I checked out the first task's events.tsv (sub-NDARAB793GL3_ses-01_task-ContrastChangeBlock1_run-01_events.tsv)
image image

So, I cut the data in EEGLab from 11s to 160s, though I think EEGLab has already ignored the data before 10.016s (So, correspondingly, I cut data before 179.794s, about 160s). Note, there always is a "boundary" at the very beginning of the data.

Then, I save the data as .set.

  1. I used the FASTER in EEGLab with setting high-pass 1Hz, low-pass 40Hz, no notch filter and re-referencing to "E11". The result is : image

17 25 62 125 have been identified as "bad channels".

  1. I run the code on python:
# import raw data
raw_path = 'raw_cropped.set'
raw = mne.io.read_raw_eeglab(raw_path)

# filter
raw.load_data()
raw_filtered = raw.copy().filter(l_freq=1, h_freq=40)

# re-reference
raw_new_ref = raw_filtered.copy().set_eeg_reference(ref_channels=['E11'])       

# mark and list bad channels
bad_channels = faster_bad_channels(raw_new_ref)
print(bad_channels)

Then the result is:

image

That is, 25 and 62 have been identified.

Both are using variance, correlation and hurst.

yanbin-niu commented 3 years ago

@georgebuzzell I took a close look at the EEGLab version of the FASTER and basically, I think understand how it works. However, I have a problem in terms of coordinates:

For each channel's position, it seems that EEGLab also has parameters in spherical coordinates:

image

based on which, they calculated the distances between electrodes

for chan2tst = eeg_chans;
    for q=eeg_chans
        distmatrixpol(chan2tst,q)=sqrt(((EEG.chanlocs(chan2tst).radius^2)+(EEG.chanlocs(q).radius^2))-(2*((EEG.chanlocs(chan2tst).radius)*...
            (EEG.chanlocs(q).radius)*cosd(EEG.chanlocs(chan2tst).theta - EEG.chanlocs(q).theta))));
    end
end

I wanted to do the same thing with our python code. However, I just can not find those parameters. And montage only has x, y and z. Of course, we can calculate the straight line distance based on x, y and z. But, ideally, we want the distance of a curved surface. Am I clear on this problem? Do you have any thoughts off top of your head?

yanbin-niu commented 3 years ago

@georgebuzzell some updates:

  1. I calculated spherical coordinates based on the x, y and z. Then I calculated the polar coordinates based on EEGLab code. (ps, FASTER uses polar coordinates to calculate the distance between the reference and other electrodes):
chs_x = np.array([loc['loc'][0] for loc in raw.info['chs']])
chs_y = np.array([loc['loc'][1] for loc in raw.info['chs']])
chs_z = np.array([loc['loc'][2] for loc in raw.info['chs']])
sph_radius = np.sqrt(chs_x**2 + chs_y**2 + chs_z**2)
sph_phi = (np.arctan2(chs_z, np.sqrt(chs_x**2 + chs_y**2)))/np.pi*180
sph_theta = (np.arctan2(chs_y,chs_x))/np.pi*180
theta = sph_theta * (-1)
radius = 0.5 - sph_phi/180
chanlocs = pd.DataFrame({'x':chs_x, 'y':chs_y, 'z':chs_z, 
                         'theta':theta, 
                         'radius':radius})

in case you are interested, I paste the relevant script from "sph2topo.m", which is from EEGLab rather than FASTER

image

Then I got the same exact polar coordinates with the FASTER:

image image
  1. I calculated the distances between the reference electrode and other electrodes, based on the FASTER code:
ref_theta = chanlocs.iloc[128]['theta']
ref_radius = chanlocs.iloc[128]['radius']
f=lambda x:np.sqrt(x['radius']**2 + ref_radius**2 
                   - 2*x['radius']*ref_radius*np.cos(x['theta']/180*np.pi-ref_theta/180*np.pi))
chanlocs['distance'] = chanlocs.apply(f,axis=1)

Then I got the same exact results with those from the FASTER:

image image
  1. correct for distances:

correlations

# correlations corrected for distance from reference electrode
raw_data = raw.get_data()
chns_var = np.var(raw_data, axis=1)
reg_var = np.polyfit(chanlocs['distance'], chns_var, 2)
fitcurve_var = np.polyval(reg_var,chanlocs['distance'])
corrected_var = chns_var - fitcurve_var
bads_var = [raw.ch_names[i] for i in _find_outliers(corrected_var)]

I got the results:

image

In EEGLAB version FASTER for correlation: 17 25 62. (I will explain the differences later)


Variance:

raw_data_ref_deleted = np.delete(raw_data,128,axis=0)
chns_cor = np.nanmean(np.ma.masked_array(abs(np.corrcoef(raw_data_ref_deleted)), 
                                         np.identity(len(raw_data_ref_deleted), dtype=bool)), axis=0)
reg_cor = np.polyfit(chanlocs.drop([len(chanlocs)-1],inplace=False)['distance'], chns_cor, 2)
fitcurve_cor = np.polyval(reg_cor,chanlocs.drop([len(chanlocs)-1],inplace=False)['distance'])
corrected_cor = chns_cor - fitcurve_cor
bads_cor = [raw.ch_names[i] for i in _find_outliers(corrected_cor, 3)]

Then I got the results: None. In EEGLAB version FASTER for variance: None.


Guess for the differences between python version of the FASTER and EEGLab version of the FASTER:

  1. When I import the cropped data into EEGLab and python, they are same:

    image image
  2. After I filtered with MNE with 1 and 40, and I also set the FASTER filter 1 and 40, the data was different:

    image image

So, I guess this might be the source of differences.


For hurst, initial python code and FASTER use different algorithm:

image

Let me know your thoughts on this @georgebuzzell and whether I am on the right track.

yanbin-niu commented 3 years ago

Another strange thing I met :

1 When I read the GSN-HydroCel-129 for the data in EEGLab, the coordinates are exactly same:

image image

2. when I read GSN-HydroCel-129 montage in MNE, the coordinates are strange:

image

Btw, this does not affect above bad channels results, since I did not use the montage file ('GSN-HydroCel-129') for both. My plan was to use the coordinates in the montage file, but because of this, I ended up using the default read-in coordinates which are same.

yanbin-niu commented 3 years ago

@georgebuzzell Some updates on computing the correlations and variances based on yesterday's discussion:

I filtered the cropped data with 0.3 and 40Hz in EEGlab. Then I used filtered data for both Python and Matlab Versions of FASTER. So, for this time I started from exactly same raw data for both.

  1. Variances: Quadratic correction for distances from reference electrode by using np.polyfit and np.polyval
reg_var = np.polyfit(chanlocs['distance'], chns_var, 2)
fitcurve_var = np.polyval(reg_var,chanlocs['distance'])
corrected_var = chns_var - fitcurve_var

Here, I got exactly same results: Python:

image

Matlab: image

Then, I used these values to calculate the z-score and identify the outliers. (Here I modified the code a little bit for max_iter=1)

bads_var = [raw.ch_names[i] for i in _find_outliers(corrected_var, threshold=3.0, max_iter=1, tail=0)]

Then, I got exactly same results:

image

The Matlab FASTER also identified those channels as bad in terms of variance values (the list is too long to show): image

In short, the variance works perfect.

  1. Correlations -- In short, though the results from python and Matlab are same, I think there is a bug in Matlab FASTER. Sounds impossible...... so, if any chance, @DMRoberts would you please help confirm this? Or anyone has EEGLab installed @F-said @SDOsmany .

Matlab FASTER sorted the distances at the very beginning. Though unnecessary (since polyfit does not need sorted x), this will not be a problem if the sequence eventually was converted back to the initial order (chn1, chn2 , ... chn 129). For example, with variance: image

vars is the list of variance values for channels from 1,2.... to 129. fitcurve(idist_inds) is the fit values for channels from 1 to 129. idist_inds helps convert back to the initial order (chn1, chn2 , ... chn 129). idist_inds is the index of channels after sorting: image

So, with variances, there is no problem.

However, with correlations: image

Similar to variances, fitcurve(idist_inds) is the values for channels from 1, 2, 3 .... to 129. However, mcorrs(dist_inds) will mess the sequence up. mcorrs is the values for channels from 1, 2, 3 ..... to 129. So, mcorrs should not be sorted again (just like vars was not sorted).

To confirm this point, the printout for fitcurve(idist_inds) in Matlab and fitcurve_cor in my code are exactly same: Matlab: image

Python:

image

However, the last step of calculating the differences between real values and fit values are different: Matlab: image

Python:

image

which did not surprise me, since the sequences were not matched with Matlab code. it is like, you use real value of channel 3 to minus fit value of channel 1. This does not look right.

@DMRoberts the Matlab code in my screenshot is from "channel_properties.m".

yanbin-niu commented 3 years ago

@DMRoberts The cropped data after being filtered with 0.3 and 40 hz. https://drive.google.com/file/d/1BWXlstz3sKeof2alMlxqe6eMBEnZXOPE/view?usp=sharing

And here is my code (sorry for some magic numbers, will refine it afterwards. -- 128 is the reference channel):

from pathlib import Path
from mne_bids import BIDSPath, read_raw_bids
import mne.io
import numpy as np
import pandas as pd
from mne.preprocessing.bads import _find_outliers

# import raw data
raw_path = 'raw_cropped_filtered.set'
raw = mne.io.read_raw_eeglab(raw_path)

# convert to spherical and polar coordinates 
chs_x = np.array([loc['loc'][1] for loc in raw.info['chs']])
chs_y = np.array([loc['loc'][0]*(-1) for loc in raw.info['chs']])
chs_z = np.array([loc['loc'][2] for loc in raw.info['chs']])
sph_radius = np.sqrt(chs_x**2 + chs_y**2 + chs_z**2)
sph_phi = (np.arctan2(chs_z, np.sqrt(chs_x**2 + chs_y**2)))/np.pi*180
sph_theta = (np.arctan2(chs_y,chs_x))/np.pi*180
theta = sph_theta * (-1)
radius = 0.5 - sph_phi/180

chanlocs = pd.DataFrame({'x':chs_x, 'y':chs_y, 'z':chs_z, 
                         'theta':theta, 
                         'radius':radius})

# Distance between electrodes and reference electrode 
ref_theta = chanlocs.iloc[128]['theta']
ref_radius = chanlocs.iloc[128]['radius']
f=lambda x:np.sqrt(x['radius']**2 + ref_radius**2 
                   - 2*x['radius']*ref_radius*np.cos(x['theta']/180*np.pi-ref_theta/180*np.pi))
chanlocs['distance'] = chanlocs.apply(f,axis=1)

# Variance
raw_data = raw.get_data()
chns_var = np.var(raw_data, axis=1)
# correct for distance from reference electrode
reg_var = np.polyfit(chanlocs['distance'], chns_var, 2)
fitcurve_var = np.polyval(reg_var,chanlocs['distance'])
corrected_var = chns_var - fitcurve_var
bads_var = [raw.ch_names[i] for i in _find_outliers(corrected_var, threshold=3.0, max_iter=1, tail=0)]
bads_var

# correlations
chns_cor = np.nanmean(abs(np.corrcoef(raw_data)), axis=0)
chns_cor[128] = np.nanmean(chns_cor)
# corrected for distance from reference electrode
reg_cor = np.polyfit(chanlocs['distance'], chns_cor, 2)
fitcurve_cor = np.polyval(reg_cor,chanlocs['distance'])
corrected_cor = chns_cor - fitcurve_cor
bads_cor = [raw.ch_names[i] for i in _find_outliers(corrected_cor, threshold=3.0, max_iter=1, tail=0)]
bads_cor
georgebuzzell commented 3 years ago

@yanbin-niu you continue to impress... I understand what you are saying in terms of the potential bug, but want to take a closer look here before commenting further.

yanbin-niu commented 3 years ago

@DMRoberts to save your time, I do not think you need to read my code and test the data. Just helping read the code in Matlab would be very helpful! I assume you have FASTER installed? (or you can get that in EEGLab extensions). I have a link for the channel_properties.m, just in case you need it! https://drive.google.com/file/d/1gHJ--cCgDa0meF7HtDYqR_u-PdTU6uLo/view?usp=sharing

problem is in 65line.

BTW, one point I forgot to mention, in Matlab, I tried to change mcorrs(dist_inds) into "mcorrs", then I get same results with my code.

DMRoberts commented 3 years ago

@yanbin-niu @georgebuzzell I think you're right - but I also just checked the FASTER website to see if this code was the most up to date, and it seems that coincidentally someone must have reported the same bug to them within the last few months! The other code in the FASTER package seemingly hasn't been updated in almost 10 years.

I couldn't find a GitHub page for the project, but on their SourceForge, after downloading the newest .zip: https://sourceforge.net/projects/faster/

The channel_properties.m file has a changelog:


%line 69 edited by RW Feb 2021 following bug report (thanks to Rick Wassing
%for spotting). 
%ORIGINAL LINE: corrected = mcorrs(dist_inds) - fitcurve(idist_inds);
%CORRECTED LINE: corrected = mcorrs - fitcurve(idist_inds);

I'm not sure the EEGLAB plug-in version had been updated to reflect the bug yet, though? I'm not sure what the process is for EEGLAB plug-ins to be updated.

yanbin-niu commented 3 years ago

@DMRoberts Thanks for taking time to look at this!

…Sorry, I did not realize the version would be a problem. I use the one that was downloaded from EEGLab extensions.

I just check this link. I think I did visit this website early on but did not realize the version from the EEGLab extensions is not the most up to date.

Anyway, many thanks for your great help! @DMRoberts

georgebuzzell commented 3 years ago

@DMRoberts many, many thanks for taking a look at this, confirming, and also, finding out that there has been a bug fix (though maybe not for the eeglab version).

@yanbin-niu once again, it is so impressive how deep you have gotten into eeg processing. Really, this is fantastic work you have done here. Not only do you now fully understand faster, and have implemented it in python, but you even discovered a bug (at least for one version of the code). So impressive; many thanks to you for your motivation to push into this and fully understand!

@yanbin-niu I think you would agree that the next steps are to now go ahead and compare your python implementation of FASTER to the most updated version that Dan mentioned. Please let us know how these compare on the test data set. With that said, I was searching for the best way to update the authros of the FASTER EEGLAB plugin of the bug, and I somehow just found another implementation of FASTER for python! My sincere apologies, @yanbin-niu, as I did a lot of searching and never found this one before, and now, only found it by accident. Previously, all I could find was the incomplete code I sent. However, this profile lists python implementation's of both FASTER and ADJUST: https://github.com/mdelpozobanos?tab=overview&from=2019-12-01&to=2019-12-31

@yanbin-niu do you mind comparing the most recent version of FASTER that dan linked to both your python implementation, as well as this other implementation?

Also, please note that we will of course want to also look at the python ADJUST implementation soon!

georgebuzzell commented 3 years ago

@DMRoberts do you have recommendations on communicating the possible eeglab-plugin FASTER bug? It is not clear to me if that goes to the FASTER authros, or if someone else authored the plugin

yanbin-niu commented 3 years ago

@georgebuzzell Thanks for the kind words and encouragement! I also learned a lot from this process!

A quick updates on the comparison between the most recent version of FASTER and python implementation:

  1. For correlation and variance, I got completely same results, including the values and the final bads that are identified:

Matlab results:

image

Python results:

image image
  1. For hurst, I also got the same results (with the code where I used Matlab FASTER algorithm). With the initial python, the results are different. I will do more research on the code you mentioned and maybe the code I mentioned. and Let you make the final decision among these algorithms.

I will take a look at the new code you just mentioned and keep you updated!

DMRoberts commented 3 years ago

@georgebuzzell I think EEGLAB pulls the plugins from a collection of .zip files that EEGLAB maintains:

I believe this is the function that gets the list of plug-ins https://github.com/sccn/eeglab/blob/c57dfeea8cc2dd09c381df4350568cebe9f979be/functions/adminfunc/plugin_getweb.m

Which directs to this list: https://sccn.ucsd.edu/eeglab/plugin_uploader/plugin_list_all.php

If you compare the .zip in the EEGLAB plug-in directory to the .zip on the Sourceforge page, they are almost exactly the same except:

1) The SourceForge version has the bug that Yanbin identified corrected, edited on March 5 2021. 2) The 'eegplugin_FASTER.m' file was edited in the EEGLAB version on December 23, 2020.

The SourceForge version also has that eegplugin file, it was just last edited on January 28, 2011. I'm not sure who maintains the version in the EEGLAB plug-in directory, the contact says 'Nolan Hugh' but the email linked there is to Arno from EEGLAB (arnodelorme@gmail.com) . Probably the best thing to do is to file an issue on the EEGLAB Github or email Arno about the bug correction. Another strange thing there is that the version number of FASTER didn't seem to be changed when the bug was corrected, both the version in EEGLAB and the version on the SourceForge have the same version number of 1.2.3b.

DMRoberts commented 3 years ago

@yanbin-niu In your examples the relative variance values seem the same between the MATLAB and Python versions, but the Python versions seem numerically smaller, ie -483.9418 in MATLAB vs. -4.839 e-10 in Python. Is one doing the computations in terms of volts, and the other is microvolts?

I think it shouldn't matter for the purposes of Faster (as you showed in your comparisons) since the values are all relative to each other, but I was just curious about why there was a difference.

yanbin-niu commented 3 years ago

@DMRoberts Yes! MNE uses volts and EEGLab uses uV. After importing the data, the difference is already in the raw data. I actually have the same question. I guess they have different default units. But, as you said, this does not impact the FASTER results.

georgebuzzell commented 3 years ago

Thank you, @yanbin-niu and @DMRoberts !

@yanbin-niu This is really great that you are getting the same results now in your python implementation, as with the matlab implementation. I think we should move forward for now with your implementation, as I trust it much more at this point! ;) I think it will be good to look at that second python implementation that I sent, but no rush on that, and no need to switch. I think yours seems to work quite well. And again, this is really impressive! SO great to have you helping us here! Let's use your current implementation for integration with the baseEEG pipeline. As we move forward and do more tests, we can always modify/optimize.

@yanbin-niu how do you feel about posting the bug report on the EEGLAB Github? Or, do you prefer that I make that post (and @ you to give you credit)?