Cho and Hwang - 2020 - Spatio-Temporal Representation of an Electoencephalogram for Emotion Recognition Using a Three-Dimen

Study Goals: Using 3D CNN representation to accurately capture the spatiotemporal dynamics of EEG signals for emotion classification.

TL;DR:

Size of the dataset: 32 participants
Number of channels: 32 EEG channels
Length of epochs that constitute a single 3D frame in time: Each epoch is 1 second long. Given the sampling frequency of 128 Hz, each epoch consists of 128 samples
Architectures used: C3D and R(2+1)D models

About the study and dataset:

The study uses the DEAP dataset, consisting of 32 participants watching 40 one-minute-long music videos.
After each 1-minute viewing, the subject rated their emotion in terms of valence, arousal, dominance, and liking on a scale from 1 to 9.
The study used 32-channel EEG data. It did not incorporate other signals like EOG, EMG, or physiological signals, which are also part of the DEAP dataset.

Data preprocessing:

The study used a preprocessed version of the DEAP dataset obtained by downsampling the EEG data to 128 Hz. Signals were filtered to preserve frequencies within the 4–45 Hz range. Artifacts related to EOG were removed.
For each trial, the first 3 seconds (relax state) were excluded, and the subsequent 60 seconds (1 minute) of data were used.
Each of the channel data for a subject is normalized by taking the mean rating divided by the standard deviation. That is, for each data point in the channel, the mean value of the channel is subtracted and then divided by the standard deviation of the channel. This process transforms the data to have a mean of 0 and a standard deviation of 1. This ensures that all channels have a comparable scale, which is essential for effective feature extraction and model training.

Traditional 2D feature representations of EEG signals and challenges:

In simple terms, 2D CNNs analyse 2D arrays (images) by applying convolutional filters across the height and width of the input data. They excel at capturing spatial patterns such as edges, textures, and shapes. When applied to EEG data, 2D CNNs treat each time slice independently. Thus, each frame is processed independently without direct learning of inter-frame relationships. This approach fails to capture how patterns develop over time.
To handle temporal data, 2D CNNs often need to be combined with sequential models like Recurrent Neural Networks (RNNs) or Long Short-Term Memory networks (LSTMs), which add complexity and require separate training processes.
In a 3D CNN, convolution and pooling operations are conducted spatiotemporally, whereas, in 2D CNNs, they are applied only spatially.

External reference: https://link.springer.com/article/10.1007/s11554-021-01161-4

Futher, each electrode records a one-dimensional signal over time.
Traditionally, the EEG data is represented in matrix form, where rows correspond to different electrodes (channels) and columns correspond to time points.
The international 10-20 system ensures that the physical distances between electrodes are either 10% or 20% of the total distance across the skull. This spatial information is crucial for accurate EEG analysis but is not reflected in the simple 2D matrix.
Thus, traditional 2D representations, which linearly number electrodes, fail to maintain these spatial proximities.
To address this, the study proposes converting 1D EEG data vectors into 2D frames that reflect the spatial distribution of the electrodes.

3D CNN in this study:

Each electrode records a one-dimensional signal over time. Given the sampling rate of 128 Hz and the 60-second duration of the video, each electrode captures 7680 samples per trial.
To capture spatial correlations, the signals from the 32 electrodes are mapped onto a 2D plane according to their physical positions on the scalp. Interpolation is used to generate a smooth 2D surface, creating a 2D EEG frame for each time point.
The normalized 1D data vector at timestamp t is converted to the 2D EEG frame of size 𝑑×𝑑. In the DEAP dataset, 𝑑=9 to accommodate the 32 electrodes.

More simply, to create a 2D grid, we need a dimension 𝑑 that can fit all 32 electrodes. A 9x9 grid provides 81 positions, which is more than sufficient to place 32 electrodes while preserving their spatial distribution (relative distances and positions as they are on the scalp).
These 2D EEG frames are concatenated along the time axis, forming a 3D stream of data. This 3D stream encapsulates both spatial and temporal information of the EEG signals.

The length of the time window, denoted by 𝑤, determines how many consecutive frames are concatenated to form each EEG stream. In this study, the sequence length is set to 1 second, as previous research has suggested that a 1-second time window is suitable for emotion recognition tasks.
With a sampling rate of 128 Hz, a time window of 1 second corresponds to 𝑤 = 128 frames. Thus, there would be 60 EEG streams per trial (as each trial lasts 1 minute).
To match the ratios of the spatial and temporal dimensions, the EEG frames were resized to 64×64 prior to the concatenation.

Architectures used:

The study has optimized two specific models based on 3D CNN architectures: C3D and R(2+1)D.
C3D model:
- Performs 3D convolutions and 3D pooling operations, which preserve temporal information throughout the network.
- The input to the model is a 3D EEG stream with dimensions 𝑐 × 𝑡 × ℎ × 𝑤 where 𝑐 is the number of channels (single channel representation of all electrodes), 𝑡 is the length of the sequence (128 frames), and ℎ and 𝑤 are the height and width of each frame (64 × 64 pixels).
- So effectively, the default size of the 3D EEG stream used in the model is 1 × 128 × 64 × 64.
- The model consists of five consecutive 3D convolution blocks, two fully connected blocks, and a fully connected layer.
- In a 3D convolution block, a 3D convolution layer uses a 7×3×3 3D kernel with a stride of 1.
- The convolution layer inputs are padded with 3×1×1 to preserve the resolution after convolution.
- Each 3D convolution layer is followed by a 3D batch normalization layer, a ReLU activation function, and a 3D max-pooling layer.
- The number of output channels increases in the first three blocks (64, 128, 256) and remains constant for the last two blocks.
- After the convolution blocks, the output is flattened into a 1D feature vector.
- Two fully connected layers follow, each with ReLU activation and a dropout layer (with a dropout probability of 0.5 during training).
- The final fully connected layer outputs the predicted probabilities for each class.
R(2+1)D model:
- Decouples 3D convolutions into subsequent 2D spatial convolution and 1D temporal convolution operations to represent more complex functions.
- The 2D Spatial Convolution applies a 2D convolution to the spatial dimensions (height and width) of the input frames. The 1D Temporal Convolution applies a 1D convolution to the temporal dimension (depth) of the output from the spatial convolution.
- By decomposing the convolution, the model introduces additional nonlinearity through the ReLU activation function placed between the 2D and 1D convolutions, with the same number of parameters.
- The use of residual connections helps in optimizing the network by allowing gradients to flow more easily during backpropagation.
- The input is a 3D EEG stream with dimensions 1 × 128 × 64 × 64.
- The input data are fed to a spatio-temporal layer. Subsequently, four residual blocks are stacked. An average pooling layer is added to the last residual block, followed by two fully connected (FC) layers.
- The spatio-temporal layer consists of two 3D convolution layers, followed by a 3D batch normalization layer and a ReLU activation function.
- The first 3D convolutional layer performs spatial convolution with 45 1×3×3 and the second one performs temporal convolution with 64 7×1×1 kernels.
- Each residual block includes two spatio-temporal layers.
- Features are summed with input data before activation in the second layer of the residual blocks.
- Downsampling of input data occurs in the first layer of each block with a stride of 2.
- The last residual block is followed by an average pooling layer.
- Then two fully connected layers follow, with the first layer having 512 units and ReLU activation.
- The final fully connected layer outputs the class probabilities.

Experiment Details and Results:

The study used stochastic gradient descent (SGD) optimization with a minibatch size of 16 and an initial learning rate of 0.01, reduced by 10 every 10 epochs for a total of 30 epochs.
Two classification tasks were used:
- Single-Label Classification (SLC): A two-class binary classification task to categorize valence vs. arousal.
- Multi-Label Classification (MLC): considers the combination of valence and arousal levels, resulting in four distinct classes: high arousal - high valence (HAHV), high arousal - low valence (HALV), low arousal - high valence (LAHV), and low arousal - low valence (LALV). It involves simultaneously considering both dimensions to classify the samples.
Data Splitting: 5-fold cross-validation scheme, using all subjects' data, splitting into 80% training and 20% testing.
Data Augmentation: Applied Gaussian noise with zero mean and unit variance during training to mitigate the relatively small dataset size.
They compared their model performance with state-of-the-art methods that used handcrafted features like power spectral density (PSD) and differential entropy (DE), and previous CNN-based methods.
For the binary classification task, R(2+1)D achieved the highest accuracy, with 99.11% for valence and 99.74% for arousal. C3D also performed well, with 98.42% for valence and 99.74% for arousal.
For the multi-label classification task, R(2+1)D outperformed with 99.73% accuracy. C3D had 98.28% accuracy.
The study highlighted that simple concatenation of raw data does not guarantee efficient feature extraction by CNNs.
The study also optimized kernel size and input dimension by trying out various values.
- Kernel size optimization: The 7×3×3 kernel size was found to be optimal for both models. This differs from the optimal 3×3×3 size found in previous studies on action classification from video clips. The reason is that EEG data has higher temporal resolution but lower spatial resolution compared to video data, necessitating a larger temporal kernel size to extract meaningful temporal features.
- Input dimension optimization: The temporal depth of 128 provided the best performance, aligning with previous research. Further, increasing spatial resolution improved the accuracy of both models. Larger spatial resolutions allow for the extraction of more meaningful spatio-temporal features, even though these features are derived from interpolated electrode values.

Challenges faced:

The DEAP dataset is relatively small compared to datasets used in other domains, such as image and video classification. The researchers used data augmentation techniques to expand the dataset artificially. Specifically, Gaussian noise with zero mean and unit variance was added to the training samples before feeding them into the networks.
EEG signals are highly variable across different subjects and even within the same subject over time. This variability makes it difficult to train a single model that generalizes well across all subjects. To address this, the study utilized a 5-fold cross-validation scheme. Data from all subjects were included in the training and testing phases, unlike some previous studies that employed subject-wise classification, which can ignore inter-subject variability.
The models used in the study, specifically the C3D and R(2 + 1)D models, are complex in terms of the number of parameters. The C3D model has 53.15 million parameters, while the R(2 + 1)D model has 33.51 million parameters. This high complexity translates to significant computational and memory requirements. Thus, these models are not suitable for real-time applications.
The models trained on the DEAP dataset may not generalize well to new subjects without retraining due to the inherent variability in EEG signal distributions between different individuals. For practical applications, particularly those involving new users, a portion of their EEG data must be incorporated into the training set to ensure accurate emotion recognition.

In the context of our data:

Our dataset comprises 2 nightly sleep recordings of 20 subjects, with 62 EEG and 2 EOG channels.
EEG measures electrical activity in the brain, providing information about different stages of sleep, while EOG measures eye movements, which can help identify REM sleep.
During preprocessing, we need to handle the additional EOG channels, which might require a different placement in the 2D frame to reflect their spatial relationship accurately.
Similar filtering can be applied, although the specific frequency range might be adjusted based on the requirements of sleep staging.
We need to adapt the existing architecture by adjusting the input size to accommodate the 64 channels.
Sleep stages often require a longer temporal context, so adjusting the length of 3D streams (e.g., longer than 1 second) might be beneficial.
We can think about effective augmentation strategies specific to sleep data, such as synthetic noise or other transformations, to enhance the training dataset.
Since the study highlighted the limitations of applying the models to real-time applications, we have to think about reducing the computational requirements.

csndl-iitd / realtime-sleep-staging

Cho and Hwang - 2020 - Spatio-Temporal Representation of an Electoencephalogram for Emotion Recognition Using a Three-Dimen #8

TL;DR:

About the study and dataset:

Data preprocessing:

Traditional 2D feature representations of EEG signals and challenges:

3D CNN in this study:

Architectures used:

Experiment Details and Results:

Challenges faced:

In the context of our data: