Open tadly opened 4 years ago
Speed is more related to hardware resources. What are you running on, and how loaded is it?
For testing actually on my dev machine so that really shouldn't be a bottleneck (Intel i7-5600U) And the system is basically idle otherwise too
Does this mean my suspicion that the engine requires the trailing 1 second before considering it a "activation" is wrong?
I should add that I'm new to ML so I'm still in a steep learning curve ':D
Edit: Does your training data also contain a 1-2 second silence ending or does it stop right after speech stops? Also, using your own model, would you say it activates as fast as f.e. google or alexa?
Should only be listening to 1.5 seconds, I think, to activate. My data cuts off usually pretty quickly. I haven't used alexa in a while, but activation seems to be near-google-speed in my experience. I run on an i7-4770 with 8gb doing a bunch of things (mycroft/wiki/tts/stt) and it's not noticeably slow.
Hm... My wake-word is "Kiana". No hey in front or anything. That makes quicker to say than "okay google" or "hey mycroft" etc.
So maybe that has an effect on the perceived speed as "hey mycroft" is much closer to a 1.5 cycle then "Kiana".
That said. After playing with the batch-size a bit I managed to get a model that at least activates (although val_acc could be much better) and it activates much faster about 50% of the time. The other 50% it feels just like the other initial model.
I'll do some more testing tomorrow and probably compare against the official mycroft model in terms of speed.
Do you have a clue why a data-set with 1 sec. of silence would perform so much better when it comes to training then the ones without? Seems kind of weird to me but would explain why the mycroft-precise wiki says you should have 1-2 sec at the end.
Dunno. A lot depends on your data. Train with more data or more steps to improve val_acc. I have something like 300 wake word samples now, and about 4x that in not-wake-words (particularly in things that triggered false activations). A good chunk of the noises in PCD are from that as well. If you're not using at least 50 samples of wake word 3x that in not, you will probably want to add more. Also use wakeword saving to build more samples, particularly of the not-wake-word variety.
Dammit, I knew I forgot to share important information...
I'm currently at:
False activations im not yet worried about as I can fix those later on through the methods you outlined in your write-up.
Given my dataset I would expect val_acc to hit 1 all the time. The fact that the set without the 1 second tail doesn't worries me (mostly because I don't understand why and I would really like to)
Changing the sensitivity to a higher value (e.g. 0.8 rather then 0.2) seems to improve activation speed which seems odd. Was only crude testing though which I'll investigate further tomorrow.
Hmm. Yeah, i was more concerned with accuracy than anything, speed never was an issue.
So, I tried the "hey microft" model and damn that thing activates fast.
I really wish I knew how exactly this model was trained. I read somewhere that it was trained using a sample-size of 90k (hope I remember that correctly) but this doesn't clarify if that's 90k "hey mycroft" or 90k of "hey mycroft" and not wake words. (an impressive number either way)
I don't know if the activation speed would improve the more data one adds or if they used different training-technics.
I have a lot more reading/learning to do it seems
50k hey mycrofts was what I heard. There's a lot of other data, including nww's they have, but not all of it is good/usable?
Interesting.
I'd have one more question if you'd be so kind.
From what I read, keras usually splits data into training and test data itself while precise doesn't do that. Instead I declare test data through the test directory. I assume this is to ensure model-creation is reputable.
The question now. How do you handle test/not-wake-words
?
Do you use psounds and/or other downloaded sound-packs or did you populate NWW all yourself?
The reason I'm asking is.
If I record stuff myself (be it a fan or whatever) which activates the model, I can create multiple recordings of the same source and put some of it in wake-word/not-wake-word
and some in test/not-wake-word
.
With downloads like psounds most of the recordings exist only once.
From my understanding you should not duplicate data between wake-word
and test
I randomly sample 10% and move it over. I use google voice commands, psounds, and a few thousand nww's I recorded/saved.
I end up running precise-test against the full wakeword dataset for fun to see where it's having issues as well. (I've run it against my nww's as well, which generally isn't as useful)
I see. This means you have some samples inside "test" that might not exist in any other form in ww.
This is all been quite help-full. Thanks a lot. I am writing some form of documentation/howto while working on this and hope to share it that with the community once it is in a good enough state. I'm trying to say, your time has not been wasted (I hope) :)
I still model words for others on occasion, so any new or better info is always welcome. But what I've gleaned is also through a bunch of trial and lots of error, so better to share that so others can get where they need to go sooner.
Oh damn that reminded me of one more question I wanted to ask. In your write-up you say:
I have only recently started recording with noisy backgrounds. Will update if I get better info.
Any news on that?
p.s. with all the additional testing I've done so far my model still is far from the activation speed of what "hey mycroft" has. I can only suspect that the more data you feed in, the quicker the model can "decide". I'll probably do a test including googles data set again (I left that out because it's so specific in what i provides)
Doesn't hurt as far as I can tell. It's mostly the captured wake words and such, they tend to be fairly noisy, and I haven't noticed a decrease in activations
While I started writing a rather chunky issue over on the mycroft-precise repo, I re-read your documentation and though I'd better ask here/you directly.
What I'm currently struggling with is activation speed. I've been very carefully in regards to my training data and ensured every clip starts immediately with the wake-word followed with 1 second of silence (silence meaning quiet room -> me not speaking).
Using my dataset combined with the following for
not-wake-words
:reaches a
val_acc 1
in about 120 epochs (super quick). While it activates quite consistantly, it does so rather slow as it requires for the trailing 1 second to pass as well.If I now duplicate the data-set and strip 500 ms from the end of every single wake-word clip, I'm suddenly unable to reach a
val_acc
higher than 0.5 Stripping 800-1000ms has me sitting onval_acc 0
. Training for more epochs (I tried up to 6000) did not help.Is this to be expected? Is there a way to work around this? Any help would be much appreciated and thanks for your current write-up. It already helped a lot :)