Decoding speech from non-invasive brain recordings

Serendipityzzz commented 1 year ago

a convolutional neural network stacked onto a 'Subject Layer' and trained with a constractive objective to predict the deep representations of the audio waveform learnt by a dedicated module pretrained on 56k hours of speech.

f8ba39009024ad0fb9dac8c46bbf808

Serendipityzzz commented 1 year ago

C is channel/sensor, T is time, X $\in$ $\mathbb{R}^{C \times T}$ be a segment of a brain recording of a given subject while she listens to a speech segment of the same duraiton. Y $\in$ $\mathbb{R}^{F \times T}$ be the latent representation of speech, here the Mel spectrogram with F frequency bands. Thus supervised decoding consists of finding a decoding function: $f{reg}: \mathbb{R}^{C \times T} \rightarrow \mathbb{R}^{F \times T}$ such that $f{reg}$ predicts Y given X. We denote $\hat{Y} = f{reg}(X)$ the representation of speech from the brain, $f{reg}$ belongs to models like DNN, then a regression loss looks like c78f9c81441ef817f4b1a1af3845d7e

Serendipityzzz commented 1 year ago

But this regression loss faces several challenges: decoding predictions appear to be dominated by a non-distinguishable broadband component when speech is present. So Meta AI made three main contributions: the introduction of a contrastive loss, a pre-trained deep speech representation, and a dedicated brain decoder.

Serendipityzzz commented 1 year ago

if we want to change the loss function, we can use NCE https://zhuanlan.zhihu.com/p/334772391

Serendipityzzz / paper_reading

Decoding speech from non-invasive brain recordings #1