Closed gsaurabhr closed 3 months ago
Study Goals: Using 3D CNN representation to accurately capture the spatiotemporal dynamics of EEG signals for emotion classification.
Size of the dataset: 32 participants
Number of channels: 32 EEG channels
Length of epochs that constitute a single 3D frame in time: Each epoch is 1 second long. Given the sampling frequency of 128 Hz, each epoch consists of 128 samples
Architectures used: C3D and R(2+1)D models
The study uses the DEAP dataset, consisting of 32 participants watching 40 one-minute-long music videos.
After each 1-minute viewing, the subject rated their emotion in terms of valence, arousal, dominance, and liking on a scale from 1 to 9.
The study used 32-channel EEG data. It did not incorporate other signals like EOG, EMG, or physiological signals, which are also part of the DEAP dataset.
The study used a preprocessed version of the DEAP dataset obtained by downsampling the EEG data to 128 Hz. Signals were filtered to preserve frequencies within the 4–45 Hz range. Artifacts related to EOG were removed.
For each trial, the first 3 seconds (relax state) were excluded, and the subsequent 60 seconds (1 minute) of data were used.
Each of the channel data for a subject is normalized by taking the mean rating divided by the standard deviation. That is, for each data point in the channel, the mean value of the channel is subtracted and then divided by the standard deviation of the channel. This process transforms the data to have a mean of 0 and a standard deviation of 1. This ensures that all channels have a comparable scale, which is essential for effective feature extraction and model training.
In simple terms, 2D CNNs analyse 2D arrays (images) by applying convolutional filters across the height and width of the input data. They excel at capturing spatial patterns such as edges, textures, and shapes. When applied to EEG data, 2D CNNs treat each time slice independently. Thus, each frame is processed independently without direct learning of inter-frame relationships. This approach fails to capture how patterns develop over time.
To handle temporal data, 2D CNNs often need to be combined with sequential models like Recurrent Neural Networks (RNNs) or Long Short-Term Memory networks (LSTMs), which add complexity and require separate training processes.
In a 3D CNN, convolution and pooling operations are conducted spatiotemporally, whereas, in 2D CNNs, they are applied only spatially.
External reference: https://link.springer.com/article/10.1007/s11554-021-01161-4
Futher, each electrode records a one-dimensional signal over time.
Traditionally, the EEG data is represented in matrix form, where rows correspond to different electrodes (channels) and columns correspond to time points.
The international 10-20 system ensures that the physical distances between electrodes are either 10% or 20% of the total distance across the skull. This spatial information is crucial for accurate EEG analysis but is not reflected in the simple 2D matrix.
Thus, traditional 2D representations, which linearly number electrodes, fail to maintain these spatial proximities.
To address this, the study proposes converting 1D EEG data vectors into 2D frames that reflect the spatial distribution of the electrodes.
Each electrode records a one-dimensional signal over time. Given the sampling rate of 128 Hz and the 60-second duration of the video, each electrode captures 7680 samples per trial.
To capture spatial correlations, the signals from the 32 electrodes are mapped onto a 2D plane according to their physical positions on the scalp. Interpolation is used to generate a smooth 2D surface, creating a 2D EEG frame for each time point.
The normalized 1D data vector at timestamp t is converted to the 2D EEG frame of size 𝑑×𝑑. In the DEAP dataset, 𝑑=9 to accommodate the 32 electrodes.
More simply, to create a 2D grid, we need a dimension 𝑑 that can fit all 32 electrodes. A 9x9 grid provides 81 positions, which is more than sufficient to place 32 electrodes while preserving their spatial distribution (relative distances and positions as they are on the scalp).
These 2D EEG frames are concatenated along the time axis, forming a 3D stream of data. This 3D stream encapsulates both spatial and temporal information of the EEG signals.
The length of the time window, denoted by 𝑤, determines how many consecutive frames are concatenated to form each EEG stream. In this study, the sequence length is set to 1 second, as previous research has suggested that a 1-second time window is suitable for emotion recognition tasks.
With a sampling rate of 128 Hz, a time window of 1 second corresponds to 𝑤 = 128 frames. Thus, there would be 60 EEG streams per trial (as each trial lasts 1 minute).
To match the ratios of the spatial and temporal dimensions, the EEG frames were resized to 64×64 prior to the concatenation.
The study has optimized two specific models based on 3D CNN architectures: C3D and R(2+1)D.
C3D model:
R(2+1)D model:
The study used stochastic gradient descent (SGD) optimization with a minibatch size of 16 and an initial learning rate of 0.01, reduced by 10 every 10 epochs for a total of 30 epochs.
Two classification tasks were used:
Single-Label Classification (SLC): A two-class binary classification task to categorize valence vs. arousal.
Multi-Label Classification (MLC): considers the combination of valence and arousal levels, resulting in four distinct classes: high arousal - high valence (HAHV), high arousal - low valence (HALV), low arousal - high valence (LAHV), and low arousal - low valence (LALV). It involves simultaneously considering both dimensions to classify the samples.
Data Splitting: 5-fold cross-validation scheme, using all subjects' data, splitting into 80% training and 20% testing.
Data Augmentation: Applied Gaussian noise with zero mean and unit variance during training to mitigate the relatively small dataset size.
They compared their model performance with state-of-the-art methods that used handcrafted features like power spectral density (PSD) and differential entropy (DE), and previous CNN-based methods.
For the binary classification task, R(2+1)D achieved the highest accuracy, with 99.11% for valence and 99.74% for arousal. C3D also performed well, with 98.42% for valence and 99.74% for arousal.
For the multi-label classification task, R(2+1)D outperformed with 99.73% accuracy. C3D had 98.28% accuracy.
The study highlighted that simple concatenation of raw data does not guarantee efficient feature extraction by CNNs.
The study also optimized kernel size and input dimension by trying out various values.
Kernel size optimization: The 7×3×3 kernel size was found to be optimal for both models. This differs from the optimal 3×3×3 size found in previous studies on action classification from video clips. The reason is that EEG data has higher temporal resolution but lower spatial resolution compared to video data, necessitating a larger temporal kernel size to extract meaningful temporal features.
Input dimension optimization: The temporal depth of 128 provided the best performance, aligning with previous research. Further, increasing spatial resolution improved the accuracy of both models. Larger spatial resolutions allow for the extraction of more meaningful spatio-temporal features, even though these features are derived from interpolated electrode values.
The DEAP dataset is relatively small compared to datasets used in other domains, such as image and video classification. The researchers used data augmentation techniques to expand the dataset artificially. Specifically, Gaussian noise with zero mean and unit variance was added to the training samples before feeding them into the networks.
EEG signals are highly variable across different subjects and even within the same subject over time. This variability makes it difficult to train a single model that generalizes well across all subjects. To address this, the study utilized a 5-fold cross-validation scheme. Data from all subjects were included in the training and testing phases, unlike some previous studies that employed subject-wise classification, which can ignore inter-subject variability.
The models used in the study, specifically the C3D and R(2 + 1)D models, are complex in terms of the number of parameters. The C3D model has 53.15 million parameters, while the R(2 + 1)D model has 33.51 million parameters. This high complexity translates to significant computational and memory requirements. Thus, these models are not suitable for real-time applications.
The models trained on the DEAP dataset may not generalize well to new subjects without retraining due to the inherent variability in EEG signal distributions between different individuals. For practical applications, particularly those involving new users, a portion of their EEG data must be incorporated into the training set to ensure accurate emotion recognition.
Our dataset comprises 2 nightly sleep recordings of 20 subjects, with 62 EEG and 2 EOG channels.
EEG measures electrical activity in the brain, providing information about different stages of sleep, while EOG measures eye movements, which can help identify REM sleep.
During preprocessing, we need to handle the additional EOG channels, which might require a different placement in the 2D frame to reflect their spatial relationship accurately.
Similar filtering can be applied, although the specific frequency range might be adjusted based on the requirements of sleep staging.
We need to adapt the existing architecture by adjusting the input size to accommodate the 64 channels.
Sleep stages often require a longer temporal context, so adjusting the length of 3D streams (e.g., longer than 1 second) might be beneficial.
We can think about effective augmentation strategies specific to sleep data, such as synthetic noise or other transformations, to enhance the training dataset.
Since the study highlighted the limitations of applying the models to real-time applications, we have to think about reducing the computational requirements.
This paper (added under references) uses 3D CNNs to classify emotions from EEG data. Please go through it and summarize it.
Especially interesting information to extract:
There might be other things that are interesting, so go through it and summarize them here.