SAGNIKMJR / move2hear-active-AV-separation

Code and datasets for 'Move2Hear: Active Audio-Visual Source Separation' (ICCV 2021)
MIT License
13 stars 0 forks source link

Evaluating the output of the Target Audio Separator Network #6

Closed sreeharshaparuchur1 closed 1 year ago

sreeharshaparuchur1 commented 1 year ago

Hi @SAGNIKMJR ,

As a part of my study, I was trying to qualitatively evaluate the output of the Target Audio Separator Network (TASN) by dumping the predicted monaural spectrogram obtained here for the nearTarget task.

I append each monaural spectrogram to a list, convert that list into a pandas dataframe and dump the dataframe as a pickle file.

I used librosa to load the saved monaural and convert it to the .wav format with the following snippet of code:

for mono in range(df.shape[0]):
    datapoint = df.iloc[mono]["MFM_SPEC"] 
    ir = datapoint.reshape(1,-1)
#     waveshow(ir[0, :15000], sr=48000)
    IPython.display.Audio(ir, rate=48000)
    Path('/home/sherlock/move2hear-active-AV-separation/visualizations/near_target_pa4o_15_explore_audio_intensity/monoFromMemSpectrogram' + str(mono) + '.wav').touch()
    sf.write('/home/sherlock/move2hear-active-AV-separation/visualizations/near_target_pa4o_15_explore_audio_intensity/monoFromMemSpectrogram' + str(mono) + '.wav', ir.T, 48000)

In this google drive link I have uploaded the predicted monaural from memory for each timestamp in the time budget of the nearTarget task. I have also uploaded the ground truth sound (belonging to the music class) to be separated from the mixed binaural spectrogram. I am unable to discern an increase or decrease in sound quality between two consecutive time stamps (even though the metrics - STFT and SI-SDR change), let alone relate the separated sound with the ground truth sound.

Kindly let me know if I am making a mistake in my inference steps or what I can do to make sense of the dumped output.

SAGNIKMJR commented 1 year ago

Sorry for the late reply! There seems to be an issue with the way you are doing the inverse fourier transform. How do you get the audio phase for going from a magnitude spectrogram to a waveform here?

sreeharshaparuchur1 commented 1 year ago

That's alright. I simply used librosas istft function to go from the spectrogram to the time series which is what I rendered. I have tried using the default parameters for hop_length, win_length and n_fft as well as values inspired from the paper (as mentioned in section 7.11.2). However, none of these methods worked.

Kindly advise me as to what I can do to be able to carry out qualitative analysis of the recovered monaural.

SAGNIKMJR commented 1 year ago

I meant what phase do you use? The model doesn't predict phase, only magnitude spectrograms.

sreeharshaparuchur1 commented 1 year ago

Apart from the process followed above using the model output, I didn't make further changes to incorporate using the phase of the signal, although, I did run into relevant code for decomposing the audio signal in simulator_eval.py.

Does the model not predicting the phase spectrogram mean that the separated signal cannot be reconstructed to be interpretable? If so, how reliable/robust are the separate metrics (SISDR and STFT) in capturing the success of separation from the mixed audio signal if they only deal with the magnitude spectrogram and not the phase spectrogram?

SAGNIKMJR commented 1 year ago

The model doesn't predict phase, as it is uncorrelated with the magnitude and predicting it is often ill-posed. This is quite common in AV separation literature (see cited works, some of them use the mixed phase, some use GT phase, and some estimate the phase with the Griffin Lim method). Irrespective of your method, you need to use some phase to compute the iSTFT. I would suggest that you dump the waveforms by tweaking the compute_wave_quality method (https://github.com/SAGNIKMJR/move2hear-active-AV-separation/blob/67b20957fc43e4e7b9ace50d0eb33af3d0246e2a/audio_separation/rl/ppo/ppo_trainer.py#LL1400C41-L1400C41)

sreeharshaparuchur1 commented 1 year ago

Thanks for the suggestion and pointing me towards relevant literature. I'll modify the compute_wave_quality function accordingly.