craffel / mir_eval

Evaluation functions for music/audio information retrieval/signal processing algorithms.
MIT License
604 stars 112 forks source link

Use mir_eval.separation.evaluate with speech and noise signals. #256

Open LukasDrude opened 7 years ago

LukasDrude commented 7 years ago

@faroit

I would like to use mir_eval.separation.evaluate to evaluate the separation performance (SDR, SIR, SNR) of a separation system in the presence of noise.

We may assume the following:

x_1: Clean speech, speaker 1 x_2: Clean speech, speaker 2 n: Additive noise

Now we may have a system S, which estimates three (possible permuted) enhanced signals: z_1: Enhanced signal 1 z_2: Enhanced signal 2 z_3: Enhanced signal 3

How would I use the function mir_eval.separation.evaluate to evaluate the result, since it currently only allows K reference signals and K target signals but does not have an additional input for noise signals.

If we find a good solution, we may add it to the docs later.

faroit commented 7 years ago

I am not sure if I understand your setup. Maybe I don't really understand what the three estimates would represent? Can you explain this in more detail, please? Also, in speech enhancement people are more using perceptual quality evaluation measures such as pesq. So I am not sure if bsseval would be the best fit here.

aliutkus commented 7 years ago

I think there are several things to say on this matter, and two main routes for dealing with that issue

A/ having noise image

=> To me, the best practice is hence to always know the true noise, as well as the true "target" source images, for evaluation.

If this is not the case and you only know target sources, and not the actual images along with the true noise, this raises the related interesting question: => how to estimate the image of those two sources withing the mix, as well as a further image for the noise?

This question is probably ill-posed, because it requires some prior assumptions on what noise should be like. If we are to implement this computation of the target images + noise within the evaluation function, this means we are going to arbitrarily make such an assumption for computation. That said, since these computations DO NOT use the estimates, but only the references and the mix, it is ok we are not going to have flaws as in the case of bsseval_sources that exploits the estimates to compute references. However, I don't see a particular consensus in how to compute these "groundtruth images" from the groundtruth sources. Anyways, this would appear as a separated module that has no particular connection with mir_eval.

B/ trying all combinations

A simple solution could be to simply try all the possible combinations, and to just discard the input source that gives the worst performance as being the noise source. This would allow inputing +1 estimate, at the cost of doing more computations. Still, doing this actually DOES NOT totally solve the problem I see with your setup, because it probably means using bsseval_sources, which I again strongly advise you not to do.