antmicro / tflite-micro-speech-demo

Apache License 2.0
2 stars 1 forks source link

Debugging hints? #1

Open tcal-x opened 2 years ago

tcal-x commented 2 years ago

Hello -- the build seems to complete correctly. After I install the bitstream and load the firmware, I see the LiteX banner, and then nothing after this:

--============= Liftoff! ===============--
*** Booting Zephyr OS version 2.5.0  ***

Can you help me distinguish between:

Some kind of heartbeat might be useful, such as an LED blink at the start of each inference.

tmichalak commented 2 years ago

@tcal-x we updated the demo and added some debug prints. If everything goes fine you should see the following output:

--============= Liftoff! ===============--
*** Booting Zephyr OS version 2.5.0  ***
Initializing model...ok
Initializing i2s driver... ok
Starting audio recording thread... ok
Awaiting the first audio samples... ok
Running.

But if you for example connect the PMOD to a wrong connector, i.e. fail to receive any samples you will see:

--============= Liftoff! ===============--
*** Booting Zephyr OS version 2.5.0  ***
Initializing model...ok
Initializing i2s driver... ok
Starting audio recording thread... ok
Awaiting the first audio samples...

Please give it a try and let us know if that helped.

tcal-x commented 2 years ago

Thanks @tmichalak! I will try it later today.

tcal-x commented 2 years ago

@tmichalak , I did spend some time with this over the weekend. I used the recent commit as a model for adding additional info messages of my own.

First, I added printouts of the audio sample magnitude to help get the levels correct. I ended up using my laptop microphones piped to its headphone jack using the command pactl load-module module-loopback latency_msec=1 (this outputs an integer ID; to stop the loopback, run pactl unload-module <ID>). This allowed me to adjust the volume to get a good magnitude on the samples.

After I saw some reaction from the demo, I still found it difficult to get it to recognize "yes" and "no". I noticed that if you said yes or no repeatedly, there was a much better chance of it being recognized.

This can be explained by some more digging. I found that each inference examines about 1.0 second of audio converted to spectrograph features, and one inference runs every 0.8 seconds. So the sample windows have very little overlap. This means that if the "yes" or "no" falls in the middle of a window, it will only occur in that window. But the recognition test averages 3 consecutive windows, so a single window recognition will only register at most 1/3 of the max value, and not hit the threshold. But saying the word repeatedly would cause a sequence of multiple windows getting a high score, and the average score passes the threshold.

I think this system is intended for a much faster inference rate, so that a single utterance will fall in multiple overlapping windows. I did a little work to see if we could make it faster. I found that of the 0.8 seconds for each inference, the inference itself takes only 0.11 seconds. So speeding up the inference won't help significantly. It seems the majority of the time is spent in collecting and copying audio samples, and running the feature generation (FFTs for the spectrograph). Maybe there are some easy improvements somewhere.

If we can't speed it up, we should consider changing the detection to trigger on just a single high score from one sample window, not an average of three. We might get some false positives, but it would be better than the current situation where it's almost impossible to recognize a single "yes" or "no".

tmichalak commented 2 years ago

@tcal-x I agree with the above as we had similar observations. We changed the number of inferences to 1 tflite-micro and it seems that the results are more consistent. Please change it at your end and see if you get a similar impression.

tcal-x commented 2 years ago

@tmichalak , @kiryk , I would like to push a little more on where the runtime is being consumed outside of the inference. Do you have any data of the time breakdown between, running FFTs, copying data, and thread switch overhead? If we really have no idea, I think it would be worth some investigation, at least some collection of basic profiling data (#interrupts / inference). How big is the i2s FIFO? Just knowing that could give us an idea of how often an interrupt occurs.

tmichalak commented 2 years ago

@tcal-x how exactly did you measure the inference time to be around 0.11 seconds? We added the timestamps around the inference call and we got 700ms with this measurement. BTW. The i2s fifo_depth is 504.

tcal-x commented 2 years ago

@tmichalak we have the micro speech model in CFU Playground, which also runs on Arty/vexriscv at 100MHz, so I ran it there to get my data. Perhaps it is a different version of the model. I am timing just the inference -- not including the conversion from raw audio to spectrograms.

tcal-x commented 2 years ago

I think it is the same model in both this demo and in CFU Playground -- https://raw.githubusercontent.com/tensorflow/tflite-micro/main/tensorflow/lite/micro/examples/micro_speech/images/model_architecture.png .

I suspect that with Zephyr's multithreading, even though the time stamps around the inference call, there is some preemption to perform audio processing (copying audio samples, and when you get enough, running the fft) that gets included in that time.

I reran the micro_speech example in CFU Playground, and now I'm getting 0.13 seconds for the inference.

tcal-x commented 2 years ago

If the audio processing uses floating point, that might be a factor adding latency (SW emulation of FP instructions), and would be helped by using a VexRiscv variant with an FPU (rv32imf).

tmichalak commented 2 years ago

@tcal-x yeah that is a good suggestion. We will into it.