Picovoice / porcupine

On-device wake word detection powered by deep learning
https://picovoice.ai/
Apache License 2.0
3.73k stars 496 forks source link

Model performance varies significantly depending on wakeword temporal separation in audio #808

Closed dscripka closed 1 year ago

dscripka commented 1 year ago

I'm noticed an odd issue when attempting to benchmark the performance of Porcupine models against audio files with different characteristics (background noise, SNR, etc.). Specifically, there seems to be significant variation in the model's true positive rate simply by changing the temporal spacing of the wake word in the testing data. For example, when using the "Alexa" dataset and pre-trained "alexa_linux.ppn" from the latest version of Porcupine, I see the true-positive rate of the model behave as shown below:

image

Happy to provide additional details and even the test files that were created, if that would be useful.

I've also noticed similar performance variations to wake word temporal separation using custom Porcupine models and manually recorded test clips, so it seems possible that the issue is not limited to just the "alexa_linux_ppn" model.

Expected behaviour

The model should perform similarly regardless of the temporal separation of wake words in an input audio stream.

Actual behaviour

The model shows variations of up to 10 percentage points in the true positive rate depending on the temporal separation of wake words.

Steps to reproduce the behaviour

1) Use the "Alexa" dataset from here

2) Use the functions in mixer.py as a foundation, create test clips of varying lengths by mixing with background noise from the DEMAND dataset (specifically, the "DLIVING" recording). The SNR was fixed at 10 db, and the same segment of the noise audio file was used for every test clip. Each test clip was converted to a 16-bit, 16khz, single-channel WAV format.

3) Initialize Porcupine, and run the test clips sequentially through the model using the default frame size (512) and default sensitivity level (0.5). Capture all of the true positive predictions and divide by the total number of test clips to calculate the true positive rate.

kenarsa commented 1 year ago

Assuming the horizontal axis is seconds I don't think this can be the case. BTW, this issue is related to a different repo. Do you mind closing this and re-open within the benchmark repo? It is quite possible there is a bug in the benchmark code but too early to say.

dscripka commented 1 year ago

@kenarsa, the wakeword benchmark code was only used to create the test audio clips, not to create the plot or feed the data to the model. That was down manually using the Python bindings for Porcupine. For example:

# Instantiate Porcupine model
porcupine = pvporcupine.create(
    access_key=access_key,
    keyword_paths=keyword_paths,
    sensitivities=sensitivities
)

# Assuming `clip` is a np.array of 16-bit integer data
for i in list(range(0, len(clip)-porcupine.frame_length, porcupine.frame_length)):
    frame = clip[i:i+porcupine.frame_length]
    result = porcupine.process(frame)

But I can still move the issue to the wakeword benchmark repo, if you prefer.

kenarsa commented 1 year ago

yes, please. please close this when that one is up and we will make time to look into it.