MikeWangWZHL / EEG-To-Text

code for AAAI2022 paper "Open Vocabulary Electroencephalography-To-Text Decoding and Zero-shot Sentiment Classification"
147 stars 33 forks source link

Datset Preprocessing #10

Open hamza13-12 opened 3 months ago

hamza13-12 commented 3 months ago

Hello. As far as I understand, you are storing the data in a pandas dataframe with one column corressponding to EEG signals and the other to text and then converting EEG signals to text, correct? Could you elaborate more on how you've achieved this dataset format so that others can organize the dataset the same way?

MikeWangWZHL commented 3 months ago

Hi! sorry I am not sure what do you mean by pandas? But data preprocssing scripts can be found in scripts/prepare_dataset.sh; for example, the util/construct_dataset_mat_to_pickle_v1.py will convert the ZuCo v1.0 .mat file into a .pickle file, which is like a python dictionary.

hamza13-12 commented 3 months ago

Pandas is a data analysis library in python used to build dataframes. I was actually asking for instructions on how to build the dataset in the format where one column corressponds to EEG signals and another one to text so that I can create seq2seq models that take EEG as input and generate text

hamza13-12 commented 3 months ago

Actually, I figured it out! After creating train_set and dev_set, I just used this snippet of code:

import pandas as pd

def dataset_to_dataframe(dataset):
    # Initialize lists to hold data
    input_embeddings_list = []
    seq_len_list = []
    input_attn_mask_list = []
    input_attn_mask_invert_list = []
    target_strings_list = []
    sent_level_EEG_list = []

    # Iterate through the dataset
    for i in range(len(dataset)):
        input_embeddings, seq_len, input_attn_mask, input_attn_mask_invert, target_string, sent_level_EEG = dataset[i]

        # Convert tensors to numpy arrays
        input_embeddings_np = input_embeddings.numpy()
        input_attn_mask_np = input_attn_mask.numpy()
        input_attn_mask_invert_np = input_attn_mask_invert.numpy()
        sent_level_EEG_np = sent_level_EEG.numpy()

        # Append to lists
        input_embeddings_list.append(input_embeddings_np)
        seq_len_list.append(seq_len)
        input_attn_mask_list.append(input_attn_mask_np)
        input_attn_mask_invert_list.append(input_attn_mask_invert_np)
        target_strings_list.append(target_string)
        sent_level_EEG_list.append(sent_level_EEG_np)

    # Create DataFrame
    df = pd.DataFrame({
        'Input Embeddings': input_embeddings_list,
        'Sequence Length': seq_len_list,
        'Input Attention Mask': input_attn_mask_list,
        'Input Attention Mask Invert': input_attn_mask_invert_list,
        'Target String': target_strings_list,
        'Sentence Level EEG': sent_level_EEG_list
    })

    return df

# Convert datasets to dataframes
train_df = dataset_to_dataframe(train_set)
dev_df = dataset_to_dataframe(dev_set)