[Micro] Run model for `Keyword Spotting` on Arduino using `TFLite`

KJlaccHoeUM9l commented 1 year ago

There are several categories in MLPerf where performance results can be submitted: https://github.com/mlcommons/tiny/tree/master/benchmark

For an initial dive into this submission for Microcontrollers, it is proposed to run a model for Keyword Spotting on Arduino: DC-CNN.

It is necessary to run this model and document the steps that needed to be taken to run it.

KJlaccHoeUM9l commented 1 year ago

@KJlaccHoeUM9l

Red-Caesar commented 1 year ago

My steps to solve this task:

Found trained model from kws_model_data.cpp
Have started to adapt the example from Tensorflow Lite - micro_speech with our new model. First, in micro_features_model.cpp changed g_model from basic to our one and changed g_model_len to 53936.
After that, I had a problem, which I couldn't solve for a long time:

So, how to solve it:

Go to micro_speech.ino and del the code below:

static tflite::MicroMutableOpResolver<4> micro_op_resolver;
if (micro_op_resolver.AddDepthwiseConv2D() != kTfLiteOk) {
return;
}
if (micro_op_resolver.AddFullyConnected() != kTfLiteOk) {
return;
}
if (micro_op_resolver.AddSoftmax() != kTfLiteOk) {
return;
}
if (micro_op_resolver.AddReshape() != kTfLiteOk) {
return;
}

Use instead:

static tflite::MicroMutableOpResolver<6> micro_op_resolver;
if (micro_op_resolver.AddDepthwiseConv2D() != kTfLiteOk) {
return;
}
if (micro_op_resolver.AddFullyConnected() != kTfLiteOk) {
return;
}
if (micro_op_resolver.AddSoftmax() != kTfLiteOk) {
return;
}
if (micro_op_resolver.AddReshape() != kTfLiteOk) {
return;
}
if (micro_op_resolver.AddConv2D() != kTfLiteOk) {
return;
}
if (micro_op_resolver.AddAveragePool2D() != kTfLiteOk) {
return;
}

Del the clause below:

if (
(model_input->dims->size != 2) 
|| (model_input->dims->data[0] != 1) 
|| (model_input->dims->data[1] != (kFeatureSliceCount * kFeatureSliceSize)) 
|| (model_input->type != kTfLiteInt8)
  ) {
MicroPrintf("Bad input tensor parameters in model");
return;
}

Use instead:

if (
(model_input->dims->size != 4) 
|| (model_input->dims->data[0] != 1) 
|| (model_input->dims->data[1] != kFeatureSliceCount) 
|| (model_input->dims->data[2] !=  kFeatureSliceSize) 
|| (model_input->dims->data[3] != 1) 
|| (model_input->type != kTfLiteInt8)
  ) {
MicroPrintf("Bad input tensor parameters in model");
return;
}

Go to micro_features_micro_model_settings.h and change constants to this:


constexpr int kFeatureSliceSize = 10;
constexpr int kFeatureSliceCount = 49;

constexpr int kSilenceIndex = 10; constexpr int kUnknownIndex = 11; constexpr int kCategoryCount = 12;

4. Go to `micro_features_micro_model_settings.cpp` and change the old array like this:

const char* kCategoryLabels[kCategoryCount] = { "down", "go", "left", "no", "off", "on", "right", "stop", "up", "yes", "silence", "unknown" };


After that you will have a working model, but it's not working properly.

It is my next issue, I will describe in the next comment.

Red-Caesar commented 1 year ago

First of all, I was got three topics to explore:

To find a place in the code, where input from the mic transform into the input tensor
How the model uses input axes
To find a place of postprocessing data

For myself I drew the next scheme:

Honestly, I can't answer fully on each question, but this is my assumptions:

Input from the mic we get all the time. The board waits signals from PDM MONO @ 16KHz system. In function PopulateFeatureData() in feature_provider.cpp we split audio on samples by time and transform it in proper way.
I can't find it. The previous model uses 1x1960 tensor. The current model uses 49x10 tensor. But feature_buffer, which we use for inputing data to model, is an 1d array. So I don't get how to ask on this question, maybe I I misunderstand something.
It is happeing in the function ProcessLatestResults() in recognize_commands.cpp. It uses the output tensor 1x12. Here we count scores for each prediction and choose the best score.

Also I found a test input data and tried to feed it to the model. And I think the problem is really in the input data @KJlaccHoeUM9l

Red-Caesar commented 1 year ago

I had a task to build boxplot for:

unprepared data
data after mfcc
data from variable dat (eval_quantized_model.py)
data from dat_q (eval_quantized_model.py)

Here my notebook: https://github.com/Red-Caesar/data-analysis-for-project/blob/main/data_analysis.ipynb

Red-Caesar commented 1 year ago

@KJlaccHoeUM9l

Red-Caesar commented 1 year ago

I forgot to add snapshot about data preparing in Arduino example. Maybe it will be useful for comparison too:

Red-Caesar commented 1 year ago

Our example: https://github.com/Red-Caesar/MLPerf-Tiny/blob/master/benchmark/training/keyword_spotting/get_dataset.py Old example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/experimental/microfrontend/lib/frontend.c

Red-Caesar commented 1 year ago

Input with unprepared data:

yes_input_preproc.txt

frontend_input = data from file above input_size = 480 (duration_ms * (kAudioSampleFrequency / 1000), where duration_ms = kFeatureSliceDurationMs = 30, kAudioSampleFrequency = 16000 ) num_samples_read = 320 ( it's 66221 once in 12 times)

vvchernov commented 1 year ago

Raw signal preprocessing on MLPerf side (model_settings['feature_type'] == "mfcc"):

cast to float32
calculate max value and normalize on it
pad end of the signal to desired samples(?) length and fill the tail by zeros
create copy of the signal for foreground scaling
add pad 2 from both sides of the foreground copy to and fill by zeros
extract slice from foreground copy started from 2 with size desired samples. In this case it is the same as the signal
the sliced foreground copy is processed by short-time fourier transform (STFT) with parameters: frame_length = model_settings['window_size_samples'], frame_step = model_settings['window_stride_samples'], window_fn = Hann
Abs and legth are calculated from STFT output
Calculate matrix for transformation spectrogram from STFT to mel-spectrogram
Calculate mel-spectrogram by the matrix
Correct mel-spectrogram shape by new number of bins
Calculate stabilazed natural log from the mel-spectrogram: log(mel_spectrograms + 1e-6)
Calculate mfcc from the logged mel-spectrogram and cut their number by model_settings['dct_coefficient_count']
Reshape the final result in the corresponding way (model_settings['spectrogram_length'], model_settings['dct_coefficient_count'], 1)

Important: points from 4 to 6 can be skipped due to output from point 3 can be used on 7 one without any processing. Notes: the reference https://kite.com/python/docs/tensorflow.contrib.slim.rev_block_lib.contrib_framework_ops.audio_ops.mfcc is given here with description of default parameters used for mfcc calculation See also pipeline for mfcc calculation here: https://www.tensorflow.org/api_docs/python/tf/signal/mfccs_from_log_mel_spectrograms

vvchernov commented 1 year ago

cc @Red-Caesar @FlexingJelly @KJlaccHoeUM9l

Red-Caesar commented 1 year ago

Scheme of preprocessing steps in frontend.c:

And the repo for any case: https://github.com/Red-Caesar/frontend-TensorFlow

Red-Caesar commented 1 year ago

Notes about tensorflow functions: https://quiver-brace-02a.notion.site/Tensorflow-18f6cf2d0c854254b7f89f822cf7bc4f

Deelvin / ck

[Micro] Run model for `Keyword Spotting` on Arduino using `TFLite` #2