balthazarneveu / gyraudio

A novel method for audio source separation
1 stars 0 forks source link

Reading list #4

Open balthazarneveu opened 6 months ago

balthazarneveu commented 6 months ago

Supervised audio separation

U-Net on STFT (Jansson 17')

Wave UNet (Stoller 18')

Unsupervised audio separation

State of the art video https://youtu.be/u2F1zA3IAFc?si=Io42C0BphQy8rbvI

IMU to predict audio

Gyrophone

A totally unrelevant but interesting paper. Static gyroscope is sensitive enough to detect a bit of audio signal from vibrations. This paper simply showed a potential breach of security in mobile devices. Pre-deep learning era. Heavily relies on gyro aliasing! https://crypto.stanford.edu/gyrophone/

AccEar

Using accelero to reconstruct audio !from the loudspeaker! https://perfecthu.github.io/publications/Oakland22-AccEar.pdf

balthazarneveu commented 6 months ago

Supervised audio separation

Wave UNet

D. Stoller et Al., WAVE-U-NET: A MULTI-SCALE NEURAL NETWORK FOR END-TO-END AUDIO SOURCE SEPARATION, ISMIR 2018

image

Network dimensions

Notes

1/ "3.2.1 Difference output layer" general for K sources For K=2 sources, predict first source and simply say that the second is the input minus first source... this is exactly what we do in our project, this is basically the same as a "denoising" problem. 2/ "3.2.2 Prediction with proper input context and resampling" Border effects will occur in any case. In our ResUNet code, we use padding. A workaround to be thought of is simply to trim ("crop") the output to discard samples in the corners. 3/ Using linear upsampling instead of "transposed conv" . We're already doing this in our ResUNet implementation. They claim to learn trainable linear upsampling (alpha x + (1-alpha ) y )... this allows compensating potential shifts. 4/ In ResUNet, Maxpooling operator was used for the downsampling phase. This may add a bit of "wobble" We may take extra care to avoid "shifting the signal" by using down and upsample which are neutral regarding shift (meaning down & up combined to not shift the audio signal - avg pooling + bilinear interpolatin will induce a half a "pixel" shift)

:warning: Using L=12 scales (or layers) requires very long audio sequences. (convolution of size 15 at lowscale ... means a big receptive field... but risk of bounds management)

balthazarneveu commented 6 months ago

Supervised audio separation

U-Net on STFT

A. Jansson et Al., SINGING VOICE SEPARATION WITH DEEP U-NET CONVOLUTIONAL NETWORK, ISMIR 2017

image