atomic14 / voice-controlled-robot

A voice-controlled robot using the ESP32 and TensorFlow Lite
MIT License
165 stars 54 forks source link

ds-cnn and a few other tips #15

Open StuartIanNaylor opened 2 years ago

StuartIanNaylor commented 2 years ago

Its much better to boil those spectrograms down to MFCC as just accurate with far less parameters.

42 i/o has done some great work with DS-CNN which quite a bit more accurate than just CNN. https://github.com/42io/esp32_kws You should check out his others as they are all great https://github.com/42io/c_keyword_spotting https://github.com/42io/dataset https://github.com/42io/tflite_kws

Mlcommons have released a huge word dataset https://mlcommons.org/en/multilingual-spoken-words/ which is great but not as great as it sounds as much of what your downloading is silence and the distribution and quality of words can be a bit hit and miss. But to have a word dataset that isn't just a few choice labels is a great addition as it can still be a struggle to get KW but for the 'unknown' label the quantity of different words is unparalleled.

https://github.com/StuartIanNaylor/Project-Ears I need to start adding some gear but because of Mlcommons I have been playing with KWS as now have enough words to create additional soundlike labels as trying to create something such as 'Unknown' is a bit infinity and beyond whilst you can pick phonetically syllable similar words which makes the training distinguish more but the extra labels especially with a softmax act also a secondary 'catch all' and increase accuracy of false positives greatly. This week or next I will start doing something more concrete with project-ears. So say for your control words you will also grab single syllable words of approx same size from MLcommons that say start with 'R' that don't have Right in that label and also leave them out of unknown and it will act as a catch-all for near KW right that without prob will be prone to false positives but the inclusion will also force the training routine to differentiate more as often there isn't enough labels and accuracy is given because the choice of classes is almost binary.

Have a look at what 42io has done with a ds-cnn as prob the best model you can squeeze onto a esp32, the s3 is much more capable but we are still without LSTM / GRU layers that rule out some great KWS models that leave the ds-cnn as the best that will run.