Open Serendipityzzz opened 1 year ago
C is channel/sensor, T is time, X $\in$ $\mathbb{R}^{C \times T}$ be a segment of a brain recording of a given subject while she listens to a speech segment of the same duraiton. Y $\in$ $\mathbb{R}^{F \times T}$ be the latent representation of speech, here the Mel spectrogram with F frequency bands. Thus supervised decoding consists of finding a decoding function: $f{reg}: \mathbb{R}^{C \times T} \rightarrow \mathbb{R}^{F \times T}$ such that $f{reg}$ predicts Y given X. We denote $\hat{Y} = f{reg}(X)$ the representation of speech from the brain, $f{reg}$ belongs to models like DNN, then a regression loss looks like
But this regression loss faces several challenges: decoding predictions appear to be dominated by a non-distinguishable broadband component when speech is present. So Meta AI made three main contributions: the introduction of a contrastive loss, a pre-trained deep speech representation, and a dedicated brain decoder.
if we want to change the loss function, we can use NCE https://zhuanlan.zhihu.com/p/334772391
a convolutional neural network stacked onto a 'Subject Layer' and trained with a constractive objective to predict the deep representations of the audio waveform learnt by a dedicated module pretrained on 56k hours of speech.