ShawnHymel / ei-keyword-spotting

156 stars 50 forks source link

Audio Sampling and Inferencing #2

Closed jimakos96 closed 3 years ago

jimakos96 commented 3 years ago

Hello ,i was wondering if you could explain how you feed the inference model with your sample . I am building a similar project in Zephyr and i want to use the Edge Impulse Library,

ShawnHymel commented 3 years ago

If you take a look at main.cpp in one of the examples (https://github.com/ShawnHymel/ei-keyword-spotting/blob/master/embedded-demos/stm32cubeide/nucleo-l476-keyword-spotting/Core/Src/main.cpp), you can see that the Edge Impulse library relies on this function call to perform feature extraction from the raw data as well as perform inference:

EI_IMPULSE_ERROR r = run_classifier_continuous(&signal, &result, debug_nn);

The signal struct holds the raw data (16-bit PCM). Because this is a "continuous" application, a double buffer is used so that one buffer is filling while the other is being read into the feature extraction/inference calculation. Each buffer holds 0.25 seconds of audio data, and each time one of them fills up, the double buffer pointers are swapped. The newly filled buffer is then used as the buffer element in the signal struct (there's some translation to floats in there) to be fed to the run_classifier_continuous() function.

There, the library computes the MFCCs of that 0.25 sec slice of audio data and appends it to a running window of 1 second's worth of MFCCs. Each time run_classifier_continuous() is called, inference is performed on the 1 second window of MFCCs (even though only 0.25 sec of new audio has been added).

I recommend working with a non-continuous example first to see how to feed raw data to their inference API. It's an easier start, but it means you'll be missing out on some data while inference is being performed.

Hope that helps!

jimakos96 commented 3 years ago

Can you tell if there is a reason why you chose to have I2S_BUF_LEN 6400 ? Is there a limit the value of I2S_BUF_LEN according to inference.n_samples?

ShawnHymel commented 3 years ago

I2S_BUF_LEN is somewhat arbitrary. It sets up a double buffer for I2S. We drop a channel (for mono) and downsample from 32kHz to 16 kHz, so only every 4th sample is used. STM32 HAL has us set up the double buffer as one contiguous set of memory, so 6400 samples is really 3200 samples for one of the buffers. We divide by 4 (drop a channel and downsample) to get 800 samples.

Every time one of those buffers fills up, audio_buffer_inference_callback() is called, and those 800 samples are copied to inference.buffers (which is another double buffer). Once one of those buffers is filled with 4000 samples (16000 / 4), the inference.buf_ready flag is set, which tells the main thread (there's only one thread...this is not an RTOS) to run inference.

Hope that helps!

jimakos96 commented 3 years ago

Thanks for the explanation its really helpful

ShawnHymel commented 3 years ago

Glad it helped!

elimsjxr commented 3 years ago

If you take a look at main.cpp in one of the examples (https://github.com/ShawnHymel/ei-keyword-spotting/blob/master/embedded-demos/stm32cubeide/nucleo-l476-keyword-spotting/Core/Src/main.cpp), you can see that the Edge Impulse library relies on this function call to perform feature extraction from the raw data as well as perform inference:

EI_IMPULSE_ERROR r = run_classifier_continuous(&signal, &result, debug_nn);

The signal struct holds the raw data (16-bit PCM). Because this is a "continuous" application, a double buffer is used so that one buffer is filling while the other is being read into the feature extraction/inference calculation. Each buffer holds 0.25 seconds of audio data, and each time one of them fills up, the double buffer pointers are swapped. The newly filled buffer is then used as the buffer element in the signal struct (there's some translation to floats in there) to be fed to the run_classifier_continuous() function.

There, the library computes the MFCCs of that 0.25 sec slice of audio data and appends it to a running window of 1 second's worth of MFCCs. Each time run_classifier_continuous() is called, inference is performed on the 1 second window of MFCCs (even though only 0.25 sec of new audio has been added).

I recommend working with a non-continuous example first to see how to feed raw data to their inference API. It's an easier start, but it means you'll be missing out on some data while inference is being performed.

Hope that helps!

How to set the time of the audio data?0.25 seems so small that I cannot make sure it can sample a whole word correctly in this 0.25s

ShawnHymel commented 3 years ago

@elimsjxr the slices are determined by EI_CLASSIFIER_SLICES_PER_MODEL_WINDOW (which should be 4 in this demo, giving a slice of 0.25s). That being said, I think what you're looking for is here:

There, the library computes the MFCCs of that 0.25 sec slice of audio data and appends it to a running window of 1 second's worth of MFCCs. Each time run_classifier_continuous() is called, inference is performed on the 1 second window of MFCCs (even though only 0.25 sec of new audio has been added).

Inference is performed for 1 second of audio data, which consists of 4 of these time slices. You would set the total window time (1 second) in your Edge Impulse project (in their web-based interface). 1 second should be enough for most basic wake words.

elimsjxr commented 3 years ago

@elimsjxr the slices are determined by EI_CLASSIFIER_SLICES_PER_MODEL_WINDOW (which should be 4 in this demo, giving a slice of 0.25s). That being said, I think what you're looking for is here:

There, the library computes the MFCCs of that 0.25 sec slice of audio data and appends it to a running window of 1 second's worth of MFCCs. Each time run_classifier_continuous() is called, inference is performed on the 1 second window of MFCCs (even though only 0.25 sec of new audio has been added).

Inference is performed for 1 second of audio data, which consists of 4 of these time slices. You would set the total window time (1 second) in your Edge Impulse project (in their web-based interface). 1 second should be enough for most basic wake words.

But when I tested the code, when I spoke each word into the microphone, I couldn't get the correct classification result. Why?Should I say a word every 0.25 seconds ?

ShawnHymel commented 3 years ago

@elimsjxr There are many reasons why inference is not working. You should not need to say the word every 0.25. Saying it once within a 1 second window should be sufficient.

What is your target keyword (or keywords)? What sample set did you use to train the model (i.e. did you use one of the Google Speech Commands Dataset words as your target keyword, or did you add your own)? Can you verify that the microphone is working through other means (e.g. capturing and recording sounds)?

elimsjxr commented 3 years ago

Excuse me, recently I have used another dateset which includes several .wav audio files, each of the audio file is 10 seconds, using a sensor sampling in 48khz. When I try to deploy the model architecture on this project, the error is : 图片 I found this function might be the problem: 图片 It is obvious to see the value of "ret" is -1002, what should I do next?

ShawnHymel commented 3 years ago

-1002 is "out of memory" (https://forum.edgeimpulse.com/t/err-mfcc-failed-1002/2075). 10 seconds is a very large window...most microcontrollers used for these TinyML demos will only be able to handle 1-2 second windows.

elimsjxr commented 3 years ago

-1002 is "out of memory" (https://forum.edgeimpulse.com/t/err-mfcc-failed-1002/2075). 10 seconds is a very large window...most microcontrollers used for these TinyML demos will only be able to handle 1-2 second windows.

Thanks! now the error is as shown below: 图片 Does it mean if I modify the value of heap memory, it can run normally(with 10s audio)? How to modify the heap memory?

ShawnHymel commented 3 years ago

I think you can update the minimum heap size in STM32CubeIDE, but heap grows in RAM as it's needed. You can't just magically tell your microcontroller to have more RAM. You'll need to buy a microcontroller with more RAM. In my experience, most microcontrollers (save for maybe something like the STM32H7 series) will not have enough RAM to handle keyword spotting with 10 second windows. You might need to move to a microprocessor and off-chip RAM (e.g. with embedded Linux) if you need that much RAM.