atomic14 / diy-alexa

DIY Alexa
MIT License
531 stars 185 forks source link

mfcc improve accuracy several % over spectrogram #10

Closed StuartIanNaylor closed 3 years ago

StuartIanNaylor commented 3 years ago

https://github.com/StuartIanNaylor/simple_audio_tensorflow

simple_audio.py is the mini command set and much quicker just to play with simple_audio.py is the full command set

Both the above are spectrograms

simple_audio_mfcc_frame_length1024_frame_step512.py is just mfcc hacked into the same You do get a decent accuracy improvement by mfcc alone over spectrogram.

simple_audio_prune.py just checks each wav against the model and deletes if under a threshold (start at .1 and work up as the model will change on each run as the worst is removed) Think i will post a csv or json of the complete pruned full command set as it may take some time :)

StuartIanNaylor commented 3 years ago

https://drive.google.com/file/d/1LFa2M_AZxoXH-PA3kTiFjamEWHBHIdaA/view?usp=sharing

PS a 'Hey Marvin' dataset

cgreening commented 3 years ago

This is brilliant!

On Wed, Jan 20, 2021 at 12:22 AM StuartIanNaylor notifications@github.com wrote:

https://drive.google.com/file/d/1LFa2M_AZxoXH-PA3kTiFjamEWHBHIdaA/view?usp=sharing

PS a 'Hey Marvin' dataset

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/atomic14/diy-alexa/issues/10#issuecomment-763237500, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFMTAHOKBUEFBVJTHD6DFLS2YO57ANCNFSM4WEMYLEQ .

StuartIanNaylor commented 3 years ago

Did you check out https://github.com/42io/esp32_kws as that is a DS-CNN and supposedly cutting edge in terms of accuracy. I have some scripts in the below for the dataset. Apols about the code standard but just hacks to produce the above I created two CSV and sorted them by average wav frequency to try and get some sort of matching and half way through swapped from bash to python and pysox. https://github.com/StuartIanNaylor/crispy-succotash

https://github.com/42io/esp32_kws/blob/master/mfcc-nn-streaming/components/kws/tf/dcnn.ipynb is a colab notbook. Prob scared him with enthusiasm :) https://github.com/42io/esp32_kws/issues/1#issuecomment-761825309

But the ideas on interoperable & extensible KWS are extremely simple as they should be and something really pressing unless any kws is going to be tied to the obsolesce of its system and it basically doesn't need to be. Its so simple you practically had it on a first attempt but with an intermediary server you can do further processing such as vad if needed.

https://commonvoice.mozilla.org/en/datasets "Download the Single Word Target Segment" That contains "Hey"

The timing accuracy of deepspeech to extract words is pretty poor and going to have a look at Kaldi.

If you ever get an urge to update Esp32 Alexa or maybe a side branch of ESP32 Universal interoperable KWS then please do. Tips try using a unidirectional with examples of noise the 42io guy seems to think a second instance could run on core 0. I am not so sure its easy buy KWS and Stream are 2 completely different states.

StuartIanNaylor commented 3 years ago

https://drive.google.com/open?id=1-kWxcVYr1K9ube4MBKavGFO1CFSDAWVG

A hey-2 and a marvin-stop dunno how that would go on?

cgreening commented 3 years ago

I think you could make that work.

I need to dig into his code and see how it's working. One thought that does occur to me with trying to run two modules - one on each core - is the amount of RAM available. I think you'd definitely need to look at using a wrover with the extra PSRAM.

Cheers Chris

On Thu, Jan 21, 2021 at 12:04 AM StuartIanNaylor notifications@github.com wrote:

https://drive.google.com/open?id=1-kWxcVYr1K9ube4MBKavGFO1CFSDAWVG

A hey-2 and a marvin-stop dunno how that would go on?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/atomic14/diy-alexa/issues/10#issuecomment-764049513, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFMTAD6BZGIHRCG36Q5SV3S25VRPANCNFSM4WEMYLEQ .

StuartIanNaylor commented 3 years ago

I am hoping I can twist your arm Chris dunno about 2 instances on each core as prob like you have read how with wifi it can easily cause a panic. It was just the logic that inference & streaming never run at the same time so can idle wifi and inference run at the same time?

Also yeah to have 2 mics they need to be unidirectional which means analogue I have x25 coming from china as could find less with good sensitivity ( do you want 2x freebies) as send an address. They where cheap just a pain to source. Only mems I know is https://invensense.tdk.com/products/analog/ics-40800/ not aware of a unidirectional I2S its possible just not aware of one. The ADC on the ESP32 is a bit pants as a technical audio term :) so yeah I was thinking Ai Thinker A1S The codec with a Max 9814 on the line ins should be extremely good as the internal ADC seems inaccurate and prone to noise. If you can do 2x instances with 2x unidirectional at a 180, 135 or 90 you can select the best confidence hit and voila budget beamforming. If not a single unidirectional by simple positioning can have the same effect and help much with noise and echo.

I just got 2x these https://www.aliexpress.com/item/32811323132.html?spm=a2g0s.9042311.0.0.46a34c4dJbwWUl https://www.aliexpress.com/item/32919183198.html?spm=a2g0s.9042311.0.0.46a34c4dJbwWUl

As just can not find a mini a1s audio dev kit anywhere but audiodev kit as the adc and audio out with ADF support is all in.

I presume https://arxiv.org/abs/2005.06720 might be the guy above but if running cut down the epoch patience to 10 or 20 as it will finish approx on the 30 - 50 mark and not run forever to squeeze 0.0001 accuracy out of the best model hit.

https://github.com/google-research/google-research/tree/master/kws_streaming

The CRNN would be prob best but I presume the ESP32 rendition of tensorflow-lite doesn't like RNN so we end up with a heavier but working DS-CNN

StuartIanNaylor commented 3 years ago

Dunno if you have had the time to look at the code but interested in what you think. I wanted to ask if is that 2 instances running on both cores or is it a slight cheat and the singular KWS is split into 2x tasks and is load shared across both cores?

If its 2nd then looks like I have lucked out as then Core0 can not be cleared if wifi is a problem with core0 panics.

cgreening commented 3 years ago

I think he's not pinning to cores, so any tasks will be scheduled on whatever core is available. But I've only had a quick look.

StuartIanNaylor commented 3 years ago

https://github.com/42io/esp32_kws/blob/master/mfcc-nn-streaming/components/kws/kws.c

He has this

static void kws_task(void *parameters)
{
  EventGroupHandle_t core0, core1;

  assert(core0 = xEventGroupCreate());
  assert(core1 = xEventGroupCreate());
  assert(xTaskCreatePinnedToCore(&fe_task, "worker_0", 3072, core0, 1, NULL, 0) == pdPASS);
  assert(xTaskCreatePinnedToCore(&fe_task, "worker_1", 3072, core1, 1, NULL, 1) == pdPASS);

  for(;;)
  {
    xEventGroupSetBits(core0, BIT0);
    xEventGroupWaitAllBitsAndClear(core0, BIT1);
    xEventGroupSetBits(core1, BIT0);
    xEventGroupWaitAllBitsAndClear(core1, BIT1);
  }
  vTaskDelete(NULL);
static void fe_task(void *parameters)
{
  const EventGroupHandle_t event = parameters;
  void *buf = malloc(KWS_RAW_RING_SZ);
  assert(buf);

  for(;;)
  {
    xEventGroupWaitAllBitsAndClear(event, BIT0);
    xQueueReceive(queue, buf, portMAX_DELAY);
    xEventGroupSetBits(event, BIT1);

    csf_float (*feat)[KWS_MFCC_FRAME_LEN] = (csf_float(*)[]) kws_fe_16b_16k_mono(buf);

    xEventGroupWaitAllBitsAndClear(event, BIT0);
    for(int i = 1; i < 6; i++) {
      int word = guess_16b_16k_mono(guess, feat[i]);
      on_detected(word);
    }
    xEventGroupSetBits(event, BIT1);

    free(feat);
  }
  vTaskDelete(NULL);
}

Which has me worried he needed both cores? I wonder if the latency of the model was a problem and its not load but latency and he is alternating chunks to 2 instances so that there is more headroom on the 20ms chunks?

StuartIanNaylor commented 3 years ago

There is always https://github.com/UT2UH/ML-KWS-for-ESP32 as someone has apparently ported the CMSIS arm libs to ESP32.

Its basically the ML-KWS-for-ARM controllers repo but on esp32 again ds-cnn is the top performer but I always had my on the CRNN as the ops are much less. You can run a crnn from https://github.com/google-research/google-research/blob/master/kws_streaming/experiments/kws_experiments_paper_12_labels.md and yeah training is much lighter even though they hugely over optimised the training steps I did run it to the end 97.4791677047809 accuracy, which with the dross in that dataset is truly huge.

Also with noise I am consistently confused on how to handle this as firstly you need to normalise your kw and noise samples so they are equal. Then make a tiered SNR of your KW of mixing in noise of the lower db to the KW of 5, 10 & 15db lower so the KW is still the predominant image. Don't put noise in !KW as the SNR ratio should result in low confidence of SNR but you can test that by feeding in noise files to the trained model. !KW should be just clear phonetics that you can later retain with noise and signals that seem to cause false positives.

Or is it the way you did it as my only question is if you mix noise into KW files and also have noise in !KW then those KW are likely to have lower confidence and higher cross entropy?

https://www.ebay.co.uk/itm/324462527996

Also I though I had padded and trimmed hey-marvin as some or just off 1sec so depending on process you may want to trim and pad them with sox

cgreening commented 3 years ago

Hmm, I missed that pinning to core somehow. Interesting. I think since he is streaming and only processing 20ms at a time it should work quite well and would give the other tasks time to run. With my one you end up with a 1 second of audio to process all in one go (though I guess you could chunk up the processing into multiple tasks somehow).

I do like the streaming approach - it feels a lot more efficient and should decrease latency I think.

With the noise issue - that is a very good question. I am not sure either - one of the things I was not sure about is how to train the network to recognise the keyword when there is background noise.

If you train with very clean keywords as positive and the noisy backgrounds as negatives then will the keyword detection just reject anything that has background noise?

StuartIanNaylor commented 3 years ago

In terms of noise think so and vice versa so if you add KW with noise do you want noise in your !kw. My take is that noise should be added to KW file as it adds and balances the collection also you need to normalise so you know what SNR your adding noise at but prob if you take your KW duplicate and mix noise in @ 5, 10, 15, 20dB / 25, 25, 25, 25% Leave !KW as clean as who cares if not recognised due to noise, its clean so it is differentiated. So before mixing normalise for noise files @ 5, 10, 15, 20dB below KW and split evenly 25, 25, 25, 25% I think model making is an art in itself and if you can run through and grade your samples and add the highest noise levels to your best KW confidence hits. All sounds a bit complex but a couple of training runs it could all just be automated an its pick your KW and go. As prob a last run where you weed out the dross on your own model inference run.

I have seen quite a few people recommend playing random audio and capturing the cause of false positives and adding to !KW which to be honest I think is fubar as the false positive may be spurious but just adding anything and everything to !KW is going to make the model more gaussian in terms of accuracy. In fact I think that method will slowly kill a model as you will garner more false postives...

Google are going crazy with streaming KWS https://github.com/google-research/google-research/tree/master/kws_streaming and yeah it helps much with latency. https://arxiv.org/abs/2005.06720 "In Table 2 we observe that the most effective and accurate streaming models are SVDF, CRNN and GRU."

Hence why I have been trying to find an example apart from the Google code above for a CRNN but GRU is a close second SVDF I haven't really seen but is the lighter of them all, again example in the Google code above but everything is so wrapped up in their framework its hard for me to work out how to just extract the model code. But again CRRN & GRU are RNNs and not sure if tensorflow for microcontrollers fully supports.

But if you run through "Hey-Marvin" the 3 Phoneme KW should give you a big accuracy uniqueness boost. MFCC prob would add another couple of % but yeah a streaming model of one of the above 3 would be nice. I am actually more interested in doing this on a Rasp-pi but keep working up from ESP-32 as want a model for all platforms so tools can be shared.

StuartIanNaylor commented 3 years ago

The 2x Ai Thinker A1s turned up the £0.20 breakout boards where for standard esp32 so its micro surgery soldering fly leads to the back as can not find a breakout board anywhere and they are so cool without the rest of the bumf.

I have some 3.3v LDO reg boards and just need to work out how to use serial rather than usb as the format is so cool and small, cheap compared to the relatively pointless bloat on the AudioDevKit board.

https://imgur.com/Ctd5FsB

I am going to use line in and I have become a big fan of this Max9814 board as with experiments with the Pi having its own ldo seems to improve SNR greatly. It would be tempting to use the 3.3v from the Max9814 but going to run 2x separate as apart from wires the regs are extremely cheap as likely from my Pi experiments will return better SNR.

https://www.ebay.co.uk/itm/MAX9814-Electret-Microphone-Amplifier-AGC-Function-Module-Board-DC-For-Arduino/152293733901

cgreening commented 3 years ago

You could probably get some basic break boards made by JLPCB for very cheap (their bare PCBs are ridiculously cheap - shipping always works out more than the actual boards).

They are also quite good on the SMT assembly side, so adding an LDO and a few LEDs is pretty affordable if you stick to their common parts. You would need to solder on the A1S module yourself as they don't have that in their parts catalogue.

If it's a basic breakout board you need without the USB connector I could design one pretty quickly and you could get it made up.

Serial flashing is pretty straightforward - I use this for most of my custom PCBs as JLCPCB don't currently support USB sockets (though that is supposed to be coming soon).

If you want to go a bit further then someone like PCBWAY will assemble pretty much anything - and they do support USB sockets so you could get a complete dev board made up.

You'll need one of those little USB-UART boards - just connect the RX to the TXD0 and the TX to the RXD0 pins (and of course GND).

Just connect the IO0 pin to Ground when you power up the board and it will go into programming mode. You can then flash it from Arduino IDE or Platform.io in the usual way.

https://www.makerfabs.com/desfile/files/ESP32-A1S%20Product%20Specification.pdf

On Wed, Jan 27, 2021 at 12:35 AM StuartIanNaylor notifications@github.com wrote:

The 2x Ai Thinker A1s turned up the £0.20 breakout boards where for standard esp32 so its micro surgery soldering fly leads to the back as can not find a breakout board anywhere and they are so cool without the rest of the bumf.

I have some 3.3v LDO reg boards and just need to work out how to use serial rather than usb as the format is so cool and small, cheap compared to the relatively pointless bloat on the AudioDevKit board.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/atomic14/diy-alexa/issues/10#issuecomment-767924991, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFMTACJ6FUFZKLQ3IQEJGDS35NULANCNFSM4WEMYLEQ .

StuartIanNaylor commented 3 years ago

Yeah that is a good idea, the boat is still out on what model but the A1S for audio and being a Wrover for £4 means its extremely restrictive just having modules or the AudioDevKit for such a great bit of kit. Much of the audio dev kit is redundant to me even the LDO & LEDS is prob not needed just a jumper for IO0/Gnd and header pins for the rest.

Its actually interesting as the circuitry for 2x 680ohm electrets with the ADC mic inputs would be cool as that would leave the line ins spare. Wonder what that ADC sounds like in comparison to the ESP32 ADC :) I know everybody focuses on mems but unidirectional mics have some big advantages unless you have DSP beamforming with omnidirectionals.

I will hold fire for now but those boards sound an excellent idea but only really need x2 but maybe if I can decide on a model I could order a qty to make it worthwhile.

The USB is of no importance to me at all.

Have you ever seen the code and an app (android/ios) to connect via bluetooth and set up the wifi ssid & pwd in non-volatile on the esp32?

cgreening commented 3 years ago

There's a robotics company up here that have a robot with a Bluetooth app that you can use to configure it. So definitely possible. You do need to burn the firmware on the device first though.

I think they actually get the factory who manufacture their boards to supply them pre-flashed with their firmware.

They can also do firmware updates over Bluetooth - very slow, but it works.

On Wed, Jan 27, 2021 at 12:43 PM StuartIanNaylor notifications@github.com wrote:

Yeah that is a good idea, the boat is still out on what model but the A1S for audio and being a Wrover for £4 means its extremely restrictive just having modules or the AudioDevKit for such a great bit of kit. Much of the audio dev kit is redundant to me even the LDO & LEDS is prob not needed just a jumper for IO0/Gnd and header pins for the rest.

Its actually interesting as the circuitry for 2x 680ohm electrets with the ADC mic inputs would be cool as that would leave the line ins spare. Wonder what that ADC sounds like in comparison to the ESP32 ADC :) I know everybody focuses on mems but unidirectional mics have some big advantages unless you have DSP beamforming with omnidirectionals.

I will hold fire for now but those boards sound an excellent idea but only really need x2 but maybe if I can decide on a model I could order a qty to make it worthwhile.

The USB is of no importance to me at all.

Have you ever seen the code and an app (android/ios) to connect via bluetooth and set up the wifi ssid & pwd in non-volatile on the esp32?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/atomic14/diy-alexa/issues/10#issuecomment-768260465, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFMTACYJ62XHJ4QJQ26TKTS4AC7ZANCNFSM4WEMYLEQ .

cgreening commented 3 years ago

Just to give you an idea on PCB prices - I got 5 boards made recently:

5 x 4 layer boards - £5.89 My bill of materials for the SMT assembly was pretty high as I've got a WROVER and a bunch of other non-standard components on my board - so my total for all the parts + assembly was £45.34

So for my 5 boards fully assembled just over £10 each.

If you just need a few capacitors/resistors and an LDO then you can get the SMT assembly done for a few pounds per board.

Shipping was the killer for me on the order at £40.75 - but I paid extra for express delivery and to pre-pay any import duty. So it worked out about £80 in total. Still £16 pounds for a custom dev board with a WROVER a DAC, some opamps and bunch of passive components is pretty amazing.

On Wed, Jan 27, 2021 at 2:39 PM Chris Greening chris@cmgresearch.com wrote:

There's a robotics company up here that have a robot with a Bluetooth app that you can use to configure it. So definitely possible. You do need to burn the firmware on the device first though.

I think they actually get the factory who manufacture their boards to supply them pre-flashed with their firmware.

They can also do firmware updates over Bluetooth - very slow, but it works.

On Wed, Jan 27, 2021 at 12:43 PM StuartIanNaylor notifications@github.com wrote:

Yeah that is a good idea, the boat is still out on what model but the A1S for audio and being a Wrover for £4 means its extremely restrictive just having modules or the AudioDevKit for such a great bit of kit. Much of the audio dev kit is redundant to me even the LDO & LEDS is prob not needed just a jumper for IO0/Gnd and header pins for the rest.

Its actually interesting as the circuitry for 2x 680ohm electrets with the ADC mic inputs would be cool as that would leave the line ins spare. Wonder what that ADC sounds like in comparison to the ESP32 ADC :) I know everybody focuses on mems but unidirectional mics have some big advantages unless you have DSP beamforming with omnidirectionals.

I will hold fire for now but those boards sound an excellent idea but only really need x2 but maybe if I can decide on a model I could order a qty to make it worthwhile.

The USB is of no importance to me at all.

Have you ever seen the code and an app (android/ios) to connect via bluetooth and set up the wifi ssid & pwd in non-volatile on the esp32?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/atomic14/diy-alexa/issues/10#issuecomment-768260465, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFMTACYJ62XHJ4QJQ26TKTS4AC7ZANCNFSM4WEMYLEQ .

StuartIanNaylor commented 3 years ago

Yeah it sort of don't make sense as the A1S dev kit is £10 ready soldered with a A1S onboard even if I don't like the size and redundant components onboard.

I would really need to figure out the AC101 http://www.x-powers.com/en.php/Info/product_detail/article_id/40 with the additional audio circuitry and get the impedance match perfect with what must be easily available electrets.

Maybe might be worth getting some blank carriers first? https://www.elecrow.com/pcb-manufacturing.html

cgreening commented 3 years ago

Makes sense. Just having a board that brings all the pins out would be handy I guess.

On Wed, Jan 27, 2021 at 10:42 PM StuartIanNaylor notifications@github.com wrote:

Yeah it sort of don't make sense as the A1S dev kit is £10 ready soldered with a A1S onboard even if I don't like the size and redundant components onboard.

I would really need to figure out the AC101 http://www.x-powers.com/en.php/Info/product_detail/article_id/40 with the additional audio circuitry and get the impedance match perfect with what must be easily available electrets.

Maybe might be worth getting some blank carriers first? https://www.elecrow.com/pcb-manufacturing.html

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/atomic14/diy-alexa/issues/10#issuecomment-768626625, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFMTAD2I2VA45EX2AYGFBLS4CJDXANCNFSM4WEMYLEQ .

StuartIanNaylor commented 3 years ago

Just for now as what you are saying is exactly what is needed and with everything else I keep twisting your arm for I might as well add to the list of probably the AudioDevKit with Audio and Serial programmer but drop the rest 'design'.

By any chance did you get an audiodevkit or A1S module? As been wondering if the noise you got with the Max9814 is a noisy ADC like it is on the RockPiS I had high hopes for. On the Pi with a MAX9814 with its own LDO I seem to get far better results with a usb soundcard than you do on the onboard ADC.

The AC101 gives you a 24bit ADC but also the ADF compatibility is also a plus, but guess for test with a Wroom or Wrover a PCM1808 is a couple of quid ebay purchase. I would be interested how you find unidirectional vs omnidirectional when it comes to noise and also if it is the onboard ADC that is noisy as thinking it could well be.

Its all a bit of a catch-22 at the moment but Linto are rehashing the HMG with a new version that is likely to be complete soon. Catch-22 on that is does the model chosen export to tenorflow4microcontrollers and think easiest way is just to suck-it-&-see. Its a great tool that is likely to be more comfortable for a noob rather than a colab or jupiter notebook but the latter are also good to automate training.

Streaming models do seem to be a good idea as latency grows along the audio chain and any reduction is a good idea so model wise its either GRU(HMG currently, CRNN(HMG do have plans) and the unknown apart from it looks really lite of SVDF as is the DS-CNN running on both cores as a single core @ 240Mhz is not enough?

StuartIanNaylor commented 3 years ago

https://www.hobby-hour.com/electronics/computer_microphone.php pretty good reference

Also the mic input is a differential but as far as I know that just means 2x bias resistors either side of the electret of half the rated impedance then to gnd.

https://www.programmersought.com/article/83463761714/

StuartIanNaylor commented 3 years ago

PS if you have the time maybe see if this model will run on the ESP32

https://github.com/tranHieuDev23/TC-ResNet Its a Keras update of https://github.com/hyperconnect/TC-ResNet

Seems another guy has a similar todo list which is a good ref https://github.com/weimingtom/wmt_ai_study as you are in his list with many other :) https://github.com/atomic14/diy-alexa