SAGNIKMJR / move2hear-active-AV-separation

Code and datasets for 'Move2Hear: Active Audio-Visual Source Separation' (ICCV 2021)
MIT License
13 stars 0 forks source link

Why predict a monaural spectrogram #4

Closed sreeharshaparuchur1 closed 1 year ago

sreeharshaparuchur1 commented 1 year ago

Hi @SAGNIKMJR,

I'd like to seek a clarification on why the paper predicts the monaural audio and not a binaural audio spectrogram. Could you explain why predicting binaural audio is prone to trivial but non-useful solutions but monaural predictions to not suffer from the same degeneracy?

Thank you

SAGNIKMJR commented 1 year ago

Hi @sreeharshaparuchur1,

If we were to predict binaural audio, the agent could learn to go very far away from the target source, where the target binaural will have a very small amplitude, and predict a near-zero estimate. That would reduce the prediction error than staying close to the target source and as a result, improve the reward, but won't really help solve the task. Does that answer your question?

sreeharshaparuchur1 commented 1 year ago

Hi,

As mentioned in the paper, the binaural spectrograms are provided to the model during training so if that ground truth data is used in the learning process, why would the policy take the agent away from the target source?

SAGNIKMJR commented 1 year ago

Even if you use the ground-truth binaurals for training the separator, the separation loss for the binaurals would be low for far-off locations. Now, if you use the reward formulation that we used in the paper and just replace the monaural separation loss with the binaural separation loss, the sparse reward (-10 * L^R_T) would be high and could induce a behavior in the policy that just learns to take the agent away to distant locations.

Another reason for not doing binaural separation is that anechoic (monaural) audio is always better for understanding what is being said, as in the case of speech, or played, in the case of music, since it's free of the spatial effects, like reverb, echo, etc. When you are doing separation, one of the objectives is to better understand the sounds individually. That's why monaural separation here makes a lot more sense than binaural separation. You can also try and look up papers that say that doing automatic speech recognition (ASR) is much easier on anechoic sounds than on spatial sounds.