ARM-software / ML-KWS-for-MCU

Keyword spotting on Arm Cortex-M Microcontrollers
Apache License 2.0
1.13k stars 414 forks source link

About model's posterior on devices #13

Closed NearLinHere closed 6 years ago

NearLinHere commented 6 years ago

Hi, Thank you for releasing your awesome work. I would like to ask some questions about the model's posterior on the devices.

According to the paper,

KWS is running at 10 inferences per second.

(1) Do you smooth the inference scores? If the answer is yes, how often do you smooth data and what is the algorithm do you use? (2) How does the confidence score come out? What is the algorithm you use?

Thank you for your time to answer these questions.

navsuda commented 6 years ago

Hi @NearLinHere , (1) In the provided example, we smooth-out softmax outputs by averaging over 3 inferences as shown here and here (2) Final detection is by comparing the final averaged score with a threshold (70% in this case) as shown here. You may have to tune the detection threshold based on your trained model and hardware.

NearLinHere commented 6 years ago

Hi @navsuda , Thank you for your kind answer. Since I am a beginner and really want to understand how to set these parameters correctly and efficiently, hope you don't mind me ask for more detail.

(1) May I ask how do you decide how many inferences results to average for smooth-out? In other words, why do you choose to average over "3"(240ms) inferences not "4", "5" or others? What factors do you take into consideration?

(2) How did you tune the threshold? Is there any guide for this?

(3) Have you implement Voice Activity Detection? I am wondering if VAD could decrease some CPU usage and CPU power comsuption. What is your opinion?

Thank you for answering all these question.

navsuda commented 6 years ago

Hi @NearLinHere, TensorFlow speech commands tutorial and code provide tools to (a) generate a continuous audio stream with keywords (b) evaluate the detection accuracy in which you can enter the averaging window length and threshold. By playing with these 2 parameters and looking at the final detection accuracy, you can get the final threshold/averaging window length. We haven't implemented voice activity detection (VAD), but you are right that VAD is a way to reduce CPU load.

NearLinHere commented 6 years ago

Hi @navsuda, Therefore, the standard process of looking for appropriate parameters is (1)using that kind of tool, (2)draw a ROC curve and (3)pick a set of parameters. Am I right?

Thank you for answering the question to VAD.

navsuda commented 6 years ago

Hi @NearLinHere You are right.

Closing the issue.

NearLinHere commented 6 years ago

@navsuda I see. Thank you!