Open dscripka opened 2 years ago
Precise is meant to operate on a continuous stream of audio. For this reason, it only it trained to output a high score for the frames immediately after the wake word. If you want to test a model against an entire audio sample you should take the maximum output value of all outputs.
Let me know if that makes sense.
@MatthewScholefield, yes, that what's the plot is showing, the model score for every frame of the two input audio samples. So for the first clip, the maximum score is around ~0.23 and for the second clip (where the only difference is a single extra frame of zero-padding), the maximum score is only around ~0.06.
It might be clearer if I make the two plots separate. So this is the model's score for all of the frames of the first input clip:
And this is the model score for the second input clip:
So even if I use the maximum of all the outputs, I get a very different value for an otherwise identical audio clip.
Oh, I see, thanks for clarifying. This is definitely not intended. Just for some clarity on how it works, it feeds audio features (MFCCs) for the last buffer_t seconds to independently produce one output. You can see the value of buffer_t
by looking in the .params
file. Overall there are two hypotheses that I can come up with as to why this could occur:
Looks like for this model buffer_t
is set at 1.5 seconds. The input audio clip is just over 4 seconds, with just about ~1.5 seconds of just background mic noise before the wake word.
That's a great point about zero-padding potentially causing an issue with the MFCC features. Here are some plots where I just duplicate the initial mic background noise for ~1 second as padding instead of zeros (so now there is ~2.5 seconds of background noise before the wake word):
Clip 1
Clip 2
Where again, the only difference between clip 1 and clip 2 is 1024 more samples of background noise padding in clip 2, the actual wake word utterance is identical. There still seems to be a significant difference in the two, in both maximum score and overall trend of the frame scores over time.
Describe the bug When using the Python bindings for Precise, I've noticed that the model predictions can vary substantially depending on where in the input audio the wake word is located. For example, The plot below shows the default "hey mycroft" model score for two repetitions of the same audio clip, where the only difference is that the second clip has one additional frame (1024 samples) of zero-padding compare to the first clip:
I'm currently doing some evaluation of Precise compared to other wakeword solutions, and this behavior is making it difficult to accurately assess performance as the length and padding of the test clips can cause significant differences in false-positive and false-negative metrics due to this behavior.
Is this behavior expected? If so, is there a recommended way to evaluate the model to minimize such effects?
To Reproduce The following code should re-produce the plot above, using the attached audio file below and model versions referenced in the code:
test_clip.zip
Expected behavior Precise should have very similar scores for otherwise identical audio that just occurs at a different position in the audio stream.