fgnt / nara_wpe

Different implementations of "Weighted Prediction Error" for speech dereverberation
MIT License
494 stars 164 forks source link

get_power_online averages for (K + 𝛥 + 1) frames instead of (δ +1) frames? #39

Closed Sciss closed 5 years ago

Sciss commented 5 years ago

I'm a bit confused here. The paper in equations 5 and 17 uses the psd context parameter δ for psd estimation, however both the OnlineWPE class's call as well as the get_power_online used by the notebook example use taps + delay + 1 (K + 𝛥 + 1) instead.

LukasDrude commented 5 years ago

Dear Hanns,

in the paper we define the variables as follows:

The link to _update_power_block() does not contain taps or psd_context. As far as I can see the same holds true for get_power_online().

In case you discuss this part of the example notebook:

def aquire_framebuffer():
    buffer = list(Y[:taps+delay+1, :, :])
    for t in range(taps+delay+1, T):
        yield np.array(buffer)
        buffer.append(Y[t, :, :])
        buffer.pop(0)

The first occurrence of taps+delay+1 means that the take all but the end of Y. The second occurrence in the range means that the loop does not start at zero. It much rather starts at taps+delay+1.

Does this help?

Sciss commented 5 years ago

dear lukas,

I might be wrong, as my Python and NumPy are quite weak, but AFAICS, get_power in both cases (self.buffer and Y_step) is called with a matrix of shape [frequency_bins, channel, taps + delay + 1] and the default psd-context argument of zero, which means get_power returns np.mean(abs_square(signal), axis=-2) and thus a shape [frequency_bins, taps + delay + 1]; this is then again averaged as np.mean(get_power(...), -1) and thus across taps + delay + 1.

LukasDrude commented 5 years ago

I think the online implementation is not a good starting point to get an idea of WPE. I would recommend looking and the math description in our and NTT's paper first.

I believe online WPE is correctly implemented but could use some polishing to make the code more readable.

Without knowing exactly what you do: I would try batch processing on non-overlapping blocks first (each maybe 30 s). Then I would go through the result and check if the cuts are audible. If they are, we can start to discuss an interpolation scheme.

LukasDrude commented 5 years ago

In addition: When you do my recommended experiment, you should try to calculate STFT on the entire file, otherwise you get artifacts at the 30 s boundaries due to the STFT already. There are ways around that (proper overlap add at the borders) but that is out of scope for this experiment.

Sciss commented 5 years ago

Ok thank you, I will try that.