42io / esp32_kws

27 stars 11 forks source link

Questions no issue #1

Closed StuartIanNaylor closed 3 years ago

StuartIanNaylor commented 3 years ago

You have employed MFCC which is great as https://github.com/StuartIanNaylor/simple_audio_tensorflow just tests seems to add 3/4% accuracy via MFCC alone.

If I take a look at https://github.com/ARM-software/ML-KWS-for-MCU or generally the bewildering work of network architecture what is a DCNN? I should of looked at the code maybe but thought I would ask DFT->CNN? I will have to read https://arxiv.org/pdf/2005.06720.pdf several times as it seeps in.

How does it compare to a CRNN with noise (SNR) as that is problematic for all but from what I have read CRNN seem to cope quite well with higher noise levels.

As said I will try reading that arxiv.org paper again but a question I keep asking as ESP32 is that do you think it would be possible to run 2x instance of the KWS? (Simple unidirectional mics and using best confidence (softmax) to forward an audio stream from the best mic) Even VAD can be server side with just MQTT to subscribe to a 'end of sentence/ stop streaming' for the ASR sentence after KW. With MFCC, MQTT & KWS is there any chance on a ESP32 it could run 2 instances on separate mics?

Also I always get confused about the Google Command set as there seems many bad samples of just simple bad padding, trims or null audio. Is it just a datum or the bad contents is also a reflection on how an architecture can cope as why have google never trimmed it out or is it just to confuse the likes such as me?

mazko commented 3 years ago

If I take a look at https://github.com/ARM-software/ML-KWS-for-MCU or generally the bewildering work of network architecture what is a DCNN?

DCNN -> Depthwise separable convolutional neural network (DS-CNN)

How does it compare to a CRNN with noise

Currently Tensorflow Micro doesn't support RNN of any kind. You can't deploy such network on microcontroller

I keep asking as ESP32 is that do you think it would be possible to run 2x instance of the KWS?

ESP32 has 2 cores 240 mhz each. In my non-streaming example there are two neural network instances running simultaneously on each of ESP32 core

Also I always get confused about the Google Command set as there seems many bad samples of just simple bad padding, trims or null audio.

I confirm that google dataset has many bad samples. I delete them during audio preprocessing.

StuartIanNaylor commented 3 years ago

Many thanks @mazko.

Now you confirm dunno why I just didn't presume DS-CNN but great.

Wow two neural network instances running simultaneously on each of ESP32 core as been wondering if this is possible without getting a WiFi panic on Core 0. Its probably obvious my ESP32 knowledge is extremely basic but I have been looking as it has annoyed me for some while that KWS isn't like any other HMI (Human Machine Interface) and interopable & extensible without being system specific.

I dunno if the mention of wifi on core 0 kills that idea as wondered that "mic stream" after KW hit means WiFi could be idle during KWS. The low cost of the ESP32 with a unidirectional mic allows the use of distributed wide array mics rather than a singular beamformer. Because of placement and the polar pattern of unidirectional a distribute wide array can cope far better with noise as 1 mic is likely to be voice==near / noise==far. Its basic positioning in rooms from opposite corners, all corners or even bigger distributed arrays. It would be sort of cool if a ESP32 could run 2 instances with 2x mics maybe at 180, 135 or 90 degrees and choice of channel by Max 'Guess' is an effective low cost form of beamforming. If WiFi and a panic on core 0 is a problem oh well it was worth a try as the DS-CNN also looks perfect to emulate on a PI where actually I may run 2x instances but its the distributed nature and low cost of the ESP32 that is of importance.

I noticed you have an LED controller and going back to interoperable & extensible KWS often they are combined into the KWS system but I think are as sperate as keyboard and mouse. I am working on the idea to create the lowest common denominators of KWS functionality and its just a networked 'pixel ring' and KW triggered MIC or 'Softmax Guess' broadcaster.

I have been thinking of a really simple intermediary server that can work with any ASR or Home Automation that is a MQTT relay and audio capture that can organise distributed arrays into zones and use the 'best' KW hit and forward to ASR / HA and provide a session for the Mic of use and associated hardware such as 'Pixel ring', 'Display, 'Audio'. Its a trade off of not enforcing any further need on a KWS of the system it communicates with and that network abstraction allows concentration on what is the lowest common detonator for system functionality with no additional needs on KWS. Having a simple intermediary server means further audio processing such as VAD for end of ASR sentence can be server side without inducing more load. Doesn't even enforce the use of an intermediary server but really interested in making something interoperable as there are some great opensource systems Mycroft, Rhasspy, Linto, Haass but its nuts to create system specific KWS when they should be just another interopable HMI and work with all and not tied to the obsolescence of the system of use.

Ignore the shocking python code but after a 1st model creation I used the softmax to delete samples starting with low <.01 Run again as that gets the really bad ones but also model improves then run a last time @ <.03/.04 as what is a bad sample to tensorflow at times is a bit of a mystery but some just are. https://github.com/StuartIanNaylor/simple_audio_tensorflow/blob/main/simple_audio_prune.py

PS OMG just don't build that model on a I5-3570 cpu version of tensorflow :)

https://www.tensorflow.org/lite/convert/rnn#:~:text=Known%20issues%2Flimitations-,Overview,TensorFlow%20Lite's%20fused%20LSTM%20operations.&text=Provide%20native%20support%20for%20standard,This%20is%20the%20recommended%20option.

"Currently there is support only for converting stateless Keras LSTM (default behavior in Keras). Stateful Keras LSTM conversion is future work."

Doesn't matter as that DS-CNN seems to work great it was pure curiosity if the CRNN was less load and what accuracy difference it would give.

StuartIanNaylor commented 3 years ago

PS if you want a 'hey marvin' dataset

https://drive.google.com/open?id=1LFa2M_AZxoXH-PA3kTiFjamEWHBHIdaA