False negative categorization and precise-test disagrees with precise-listen

bml1g12 commented 6 years ago

Using v0.2.1

I have trained a model as per instructions on https://github.com/MycroftAI/mycroft-precise/wiki/Training-your-own-wake-word

precise-test gives the following output:

=== False Positives ===
/home/ben/software/mycroft-precise/hey-computer/test/not-wake-word/generated/train_complete_ride_between_two_stations-777.wav
/home/ben/software/mycroft-precise/hey-computer/test/not-wake-word/generated/train_complete_ride_between_two_stations-637.wav

=== False Negatives ===
/home/ben/software/mycroft-precise/hey-computer/test/not-wake-word/generated/train_complete_ride_between_two_stations-802.wav
/home/ben/software/mycroft-precise/hey-computer/test/not-wake-word/generated/train_complete_ride_between_two_stations-368.wav

=== Counts ===
False Positives: 2
True Negatives: 2238
False Negatives: 2
True Positives: 11

=== Summary ===
2249 out of 2253
99.82 %

0.09 % false positives
15.38 % false negatives

False negatives are defined as "Was a wake word, but model incorrectly predicts it was not" but it seems to erroneously be finding false negatives from the subfolder "test/not-wake-word/", as can be seen from the above output. As I understand it, a false negative should only be found in the /wake-word/ subfolder. Do you know why this might be?
I used pavucontrol to play the test data .wav files manually and route the audio back into into precise-listen as a microphone input. I find that of my test files, only about 3 of the 13 /hey-computer/test/wake-word/ will activate sufficiently to trigger "activate.wav" with many producing almost no "probability" on the "------XXXX" style visualization. Is there any pre-processing done in precise-test "validation" tests (e.g. setting volume, clipping of the file, changing formats etc.) which are different to the precise-listen? Also is the probability threshold for activation the same?

Thanks for any help

MatthewScholefield commented 6 years ago

The problem causing issue 1 is that we are now shuffling the input data but not shuffling the filenames which causes the filenames to be completely wrong. As a temporary fix, you can comment precise/train_data.py:214 so that it doesn't shuffle the data, but it will take a little restructuring to get a proper fix. Edit: I've created issue #28 to track this.

As for number 2, the only differences I can think of are the following:

Volume level
- Combination of pulseaudio app input level, system volume level, and the volume of the audio player playing the wav file
Different alignment of audio files
- Audio is chopped up into 0.05 second chunks. Feeding from precise listen, you might happen to play the wav from anywhere in those 0.05 seconds which can create slightly different audio features. It's possible that the network overfit to the small amount of data at that specific alignment. However, the difference you described seems extreme enough that I wouldn't expect this to be the sole reason

So, let me know if by changing the volume level you can get it to activate similar to precise-test and if not, we can debug further.

bml1g12 commented 6 years ago

Thanks for the feedback.

Just to make sure I understand, is it correct that issue 1 is to do an issue with the filename string being output i.e. it really did fine two genuine false negatives (but we don't know the filename), as opposed to getting confused over what data is a wake-word due to a filename error?

Could I also confirm I understand the methodology: a) The model is trained using a Binary crossentropy with a bias towards false negatives, where each training datum is a .wav file of variable length dependent on the wav file. b) It takes a 10 ms window, and moves it in 5ms "hops" and over that 10ms window, the FFT and corresponding MFCCs are calculated within each window

So when using precise-train/precise-test A training .wav of 3000 millisecond duration .wav has features produced every 5ms, so an input shape of 600. i.e. the input to the model is variable and dependent on the input .wav duration?
When using precise-listen the same would happen, but window is generated on-the-fly (I guess "buffer_t": 1.5 is the amount of seconds that is stored before it starts doing anything) and fed in on the fly?
No volume modification/normalisation of the .wav is done using precise-train/precise-test (because MFCCs do not include the 0th order cofficient, they are relatively volume independent so data normalisation may not be needed).

I'll investigate the volume aspect

MatthewScholefield commented 6 years ago

Yes, it is only an issue with displaying the filenames.

a) Each training input is a wave file of fixed length. It uses the last 1.5 seconds or it zero pads the file. I did not think of this, but if your files are shorter than 1.5 seconds, this could be a reason for the difference between precise-test and precise-listen, although I would assume it would still be similar since before playing the wave, your audio output would be 0s.

b) It takes a 100 ms window and moves the 100 ms window 50 ms to the right across the 1.5 second buffer. So this means each input has a length of (1500 - 100) // 50 + 1 which is 29.

The 0th MFCC coefficient is actually replaced with the log filter and energies which is sort of like the volume. Regardless, the MFCCs change slightly with volume beyond the 0th coefficient.

bml1g12 commented 6 years ago

I see thank you.

a) Indeed most of my files are less than 1.5 seconds.

If I play a sound (looping the audio back into precise-train listen as an input) that precise-train reports as a True Positive, sometimes precise-listen only shows a small signal. Indeed playing with volume and the timing between when I play a sound, I can usually get it these "problem" .wav files to activate, but not consistently. It is unfortunate there seems to be quite a disparity here between real time and offline test results in this case. This is not something you've experienced before? If so, then maybe something to do with my sound device. Especially given that I am also occasionally getting some weird behavior. For example, if I play a sound (looping the audio back into precise-train listen as an input) which it correctly determines as a wake word and wait, then it works well. But if I play it say 4 times quickly in succession, it will get into what looks like a sort of feedback loop where it activates maybe 12 times, with a pause in between each time. Not really sure what is going on.

b) I see thank you

c) I previously tried a model for keyword detection based on Andrew Ng's deep learning course, but found it often gave false positives on any loud noise. That model uses the raw FFT spectrum and processes it like an image with a convolutional filter, followed by a time-distributed GRU so produces a set of predictions over time rather than a binary prediction for a given window. (something like that shown here ) It has vastly more parameters than mycroft-precise. I think MFCCs might be less volume sensitive, which was what lead me here partly. Mycroft-precise is also a lot less computationally expensive to train!

sachin-n-AI commented 5 years ago

@bml1g12 which works better mycroft-precise or the model in the andrewNg's course?

MycroftAI / mycroft-precise

False negative categorization and precise-test disagrees with precise-listen #27