Using a large model on a ESP32 S3 N16R8 (TFMIC-37)

nicklasb commented 2 months ago

Hi,

I am inferring a large model (~2 MB) on the ESP32 S3, and it takes about 60 seconds, while taking about 50 ms on my PC. As the tensor arena seem to have to be about 5 MB to satisfy TF lite, and the RGB image is larger than SRAM (it's a YOLO model, it prefers RGB), obviously everything ends up happening in PSRAM, which slows things down significantly. However I don't think it does so by a factor of thousand even though the ESP32 S3 is obviously slower as well.

What can I do? Can I override the memory allocator to put just some stuff in SRAM, I have about 300K available that isn't being used? I saw that the p4 will basically have SRAM as a cache for a faster PSRAM more or less, is there something similar that could be done in the meanwhile? Or something else?

vikramdattu commented 2 months ago

Hi @nicklasb ESP32-S3 have cache option and you can explore the same. If you have that much internal RAM underutilised, do increase the Data cache and I cache sizes from menuconfig to their max. It should give you a good boost in performance.

As far as moving some of the allocations to internal goes, the current tflite structure is not flexible enough to allow that. You may move some of the critical kernels (from esp-nn) to IRAM to make them always persist in RAM to boost it even further. Please explore esp_att.h for the same. Simply add IRAM_ATTR in front of the function and it will be placed in the IRAM.

5MB tensor Arena requirement is indeed high and I am not sure if this should be the case. I will give the Yolo model a try and do some experiments myself.

nicklasb commented 2 months ago

Hi, thanks for you answer!

Hi @nicklasb ESP32-S3 have cache option and you can explore the same. If you have that much internal RAM underutilised, do increase the Data cache and I cache sizes from menuconfig to their max. It should give you a good boost in performance.

I am afraid that they have not done much difference in my case, also, their maximum values aren't very high, I will revisit them and see if have missed something.

As far as moving some of the allocations to internal goes, the current tflite structure is not flexible enough to allow that.

It sort of is IMO, but only up to a point, the MicroAllocator can be initialized with a non-persistent area that I think could help to some degree. However, I have not been able to make that work, then again C++ semantics is not my home turf (yet) and they seem to use all the tricks. There are some PR:s that touch this over at TFLite, if Espressif put some of their might behind that I think it would make a difference.

You may move some of the critical kernels (from esp-nn) to IRAM to make them always persist in RAM to boost it even further. Please explore esp_att.h for the same. Simply add IRAM_ATTR in front of the function and it will be placed in the IRAM.

Ok, I will take a look.

5MB tensor Arena requirement is indeed high and I am not sure if this should be the case.

It is, but perhaps it is not that strange, the model is > 2 MB. Either way, the size doesn't matter as long it fits well within PSRAM. It is more about that the frequent stuff needs a faster memory.

I will give the Yolo model a try and do some experiments myself.

For your information, when testing out YOLO, its export.py can both quantize to int8 and directly export to .tflite format, I went on a long tangent before realizing that. Also I used YOLOv5.

nicklasb commented 2 months ago

@vikramdattu

Please explore esp_att.h for the same. Simply add IRAM_ATTR in front of the function and it will be placed in the IRAM.

I am afraid that made no discernable difference.

What I am going to do now is clone this library instead of using it as a stand-alone compinent, and focus on if I can override the memory management in some way by using MicroAllocator. Maybe I can help out in some way.

nicklasb commented 2 months ago

@vikramdattu On a side note, I ran the model without the nn function optimization, and that made the inference 3.5 times slower. So good work there. :-)

Wierdly, setting compiler option to (x) Optimize for performance (-O2) makes inference about 20 percent slower. And the inference produced some strange results. Very odd.

vikramdattu commented 2 months ago

Did you set -O2 from menuconfig options or from esp-nn/CMakeLists.txt? Can you please share the observations about strange results? Do you get the bit mismatch?

nicklasb commented 2 months ago

Hi, I set it from menuconfig and basically the box coordinates ended upp smaller, I had some of them going negative. I don't have a working codebase currently (writing a custom MicroAllocator), I can get you more specifics tonight (CET).

nicklasb commented 2 months ago

I couldn't as I am now deep into the custom MicroAllocator/planner, but generally, all values became smaller, should not be too hard to replicate with any YOLO model, there is nothing special with mine. I did have a strange general issue were all boxes end up about 50 pixels too high up (with the optimizations on), likely unrelated to this but just to mention it.

espressif / esp-tflite-micro

Using a large model on a ESP32 S3 N16R8 (TFMIC-37) #94