=> Acoustic feature gallery (2D images)
Acoustic features acquired by my Acoustic Feature Camera:
Inference using X-CUBE-AI on STM32L476RG:
Inference using Keras/TensorFlow on PC instead of X-CUBE-AI on STM32L476RG:
I have discoverd that low-end edge AI works very well as long as the conditions described in this README are satisfied. If simple classification is a requirment, think of low-cost AI based on MCU before MPU/FPGA/GPU!
I find the "life log" use case (dataset: "my home") in this project works very well, but the problem is that it takes a lot of efforts -- three hours for acquiring dataset at each room in my house. This is my hobboy project, and I do not need to worry about if this can make a profit or not.
"key word detection" use case is also not so bad. It can be used as voice commands to controll home applicances, such as "turn on!" or "turn off!".
"acoustic scene classification" is the hardest due to the disturbance from surrounding noises. I think it is not useful in a real world.
I have tested all of the use cases above, and confirmed that my device works well.
Note: the size of neural network is so small that it is not a general purpose tool -- it is good at very limited number of classes.
ARM Cortex-M4(STM32L476RG)
***** pre-processing ***** ***** inference *****
................................................................
: Filters for feature extraction Inference on CNN :
: .................. :
Sound/voice ))) [MEMS mic]--PDM-->[DFSDM]--+->[]->[]->[]->[]---+----Features--->: code generated : :
: | | : by X-CUBE-AI : :
: +------------+ | .................. :
: +-----------|------+ :
: | | :
: V V :
:..[USART]......[DAC]..........................................:
| |
| | *** monitoring raw sound ***
| +---> [Analog filter] --> head phone
(features)
|
| *** learning ***
+--(dataset)--> [oscilloscope.py/Win10 or RasPi3] Keras/TensorFlow
|
| *** inference ***
+--(dataset)--> [oscilloscope.py/Win10 or RasPi3] Keras/TensorFlow
Platform:
I developed the following components:
I acquired data on my own by using the components above, and it took a lot of time and effort.
To run a neural network on MCU (STM32 in this project), it is necessary to make the network small enough to fit it into the RAM and the flash memory:
Usually, raw sound data (PCM) is transformed into the following "coefficients" as features:
My experiments so far showed that MFSCs+CNN ourperformed MFCCs+DNN or MFCCs+CNN. And DNN tends to use more memory space than CNN does (more flash memory space, in case of X-CUBE-AI). So I use MFSCs for deep learning in this project.
The following CNN model performs very well and avoids over-fittting in most of the use cases I have ever tried:
Orignal data size: PCM 16bit 512*32 (26.3msec*32)
SFFT/Spectrogram size
- Stride: 13.2msec * 64
- Ovelap: 50%
MFSCs resolution: filterbank of 40 triagle filters
Quantized input tensor: MFSCs int8_t (64, 40, 1)
However, X-CUBE-AI currently supports float32_t only, so int8_t is just for transmitting the data to PC over UART.
CNN model on Keras
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_81 (Conv2D) (None, 62, 38, 8) 80
_________________________________________________________________
max_pooling2d_79 (MaxPooling (None, 31, 19, 8) 0
_________________________________________________________________
dropout_57 (Dropout) (None, 31, 19, 8) 0
_________________________________________________________________
conv2d_82 (Conv2D) (None, 29, 17, 16) 1168
_________________________________________________________________
max_pooling2d_80 (MaxPooling (None, 14, 8, 16) 0
_________________________________________________________________
dropout_58 (Dropout) (None, 14, 8, 16) 0
_________________________________________________________________
conv2d_83 (Conv2D) (None, 12, 6, 32) 4640
_________________________________________________________________
max_pooling2d_81 (MaxPooling (None, 6, 3, 32) 0
_________________________________________________________________
dropout_59 (Dropout) (None, 6, 3, 32) 0
_________________________________________________________________
flatten_27 (Flatten) (None, 576) 0
_________________________________________________________________
dense_62 (Dense) (None, 128) 73856
_________________________________________________________________
dropout_60 (Dropout) (None, 128) 0
_________________________________________________________________
dense_63 (Dense) (None, 18) 2322
=================================================================
Total params: 82,066
Trainable params: 82,066
Non-trainable params: 0
=> Japanese word "sushi" via convolution layer
I loaded a trained CNN model (Keras model) into Cube.AI and generated code for inference. The model consumed only 25KBytes of SRAM and 105Kbytes (compressed) of Flash memory, and the duration of inference was around 170msec on STM32L476RG.
The duration of 170msec is acceptible (not too slow) in my use cases.
And I know that Arm is working on Helium, so it will be able to process acoustic features for inference in real time.
Room impulse response
:
V
Sound -->(Line distortion)--(+)->[Feature engineering]--Feature->[Normalization]->[Neural Network]->Inference
convolved ^
| Added
|
(Ambient noise)
I have been observing that room impulse response (it turns into line distortion) has an lot of effect on inference.
My strategy for tackling the problem is:
If the above conditions are satisfied, this small neural network works very well.
I have been observing that sound of air conditioner affects accuracy of inference significantly.