fakufaku / fast_bss_eval

A fast implementation of bss_eval metrics for blind source separation
https://fast-bss-eval.readthedocs.io/en/latest/
MIT License
130 stars 8 forks source link

Results of the SIR Evaluation #14

Closed Shin-ichi-Takayama closed 2 years ago

Shin-ichi-Takayama commented 2 years ago

Hello. I have a question about the SIR evaluation.

wav.zip Attached is the wav file we used for the evaluation. voice_ref.wav: voice only file noise_ref.wav: noise only file mix.wav: file with voice and noise mixed eval.wav:voice estimated for mix.wav

I evaluated the SIR of eval.wav using voice_ref.wav and noise_ref.wav as reference signals. Then, the SIR was 0.659 dB. Next, I evaluated the SIR of mix.wav using voice_ref.wav and noise_ref.wav as reference signals. The SIR was then 3.864 dB.

I had understood that as the noise decreased, the SIR value would increase. However, this is the opposite result. Why does this happen? Is the evaluation process not good?

Best regards.

fakufaku commented 2 years ago

There seems to be a time-offset in your estimated signal. This is probably the cause of the problem. You should time align the estimated signal with the reference first. image

Shin-ichi-Takayama commented 2 years ago

Thank you for your prompt reply. I will try to match the time offsets and re-evaluate. I will close this issue once as it takes time to re-evaluate. Best regards.

fakufaku commented 2 years ago

I manually found there is a ~1006 samples offset. If corrected, I get

sdr: [ 13.56245943 -19.54378993]
sir: [ 31.22722523 -19.35784393]
sar: [13.64073051 13.64073051]
sdr_mix: [ 3.95153989 -3.68755054]
sir_mix: [ 3.95154065 -3.68755023]
sar_mix: [72.99887234 72.99887234]

which looks correct.

Here's the code I used.

import fast_bss_eval
import numpy as np
from scipy.io import wavfile

fs, audio_eval = wavfile.read("wav/eval.wav")
fs, audio_mix = wavfile.read("wav/mix.wav")
fs, audio_noise_ref = wavfile.read("wav/noise_ref.wav")
fs, audio_voice_ref = wavfile.read("wav/voice_ref.wav")

print(fs)

val = []

for offset in range(1006, 1007):
    est = np.stack([audio_eval[offset:], audio_eval[offset:]])
    ref = np.stack([audio_voice_ref[:-offset], audio_noise_ref[:-offset:]])
    mix = np.stack([audio_mix[:-offset], audio_mix[:-offset]])

    sdr, sir, sar, perm = fast_bss_eval.bss_eval_sources(ref, est)
    sdr_mix, sir_mix, sar_mix, perm = fast_bss_eval.bss_eval_sources(ref, mix)

    val.append(sdr[0])

    print("sdr:", sdr)
    print("sir:", sir)
    print("sar:", sar)
    print("sdr_mix:", sdr_mix)
    print("sir_mix:", sir_mix)
    print("sar_mix:", sar_mix)

best = np.argmax(val)
print(best, val[best])
fakufaku commented 2 years ago

Such time offsets are usually introduced by STFT.

Shin-ichi-Takayama commented 2 years ago

Thank you very much for your help.