The ESP32-P4 holds a lot of promise, perhaps it might even make it possible to infer a "real" DL-model like Yolo in a respectable time. And thus, I have some thoughts on what I would like to see happen with this library:
Memory
The most glaring thing currently with regards to the S3, is that because ML-models doesn't fit in ram, the tensor arena in its entirety ends up in PSRAM.
Initially it seemed like the MicroAllocater could be used to allocate a non-persistent part of the tensor arena to SRAM. But it seemed to have been protected in some C++ way, and I couldn't make that work earlier. It would appear to me even now this must be slowing down inference by several times.
Either way, the memory handling must evolve significantly (I mean, there is TCM, don't let PSRAM spoil that)
Because If this would stay the same on the P4, it would nullify most benefits, as it would just wait on PSRAM access all the time. It would just be like 10% faster, which would be sad.
XAI Extensions
If the memory starts moving, the extensions, and the FPU(!) would start to really matter, are there any plans on a custom kernel that exploits those?
That was just my 5 ¢. But I feel it is important for your customers to know a little bit what is planned down the line now. If the ESP32-P4 would make the ESP32 seriously usable for a real YOLO model, for example, it would be a game changer with a lot of use cases where you now have to involve an RPI 5 or something instead. Which while less attractive to me from a product development and deployment standpoint, I know will do the job.
Hi,
The ESP32-P4 holds a lot of promise, perhaps it might even make it possible to infer a "real" DL-model like Yolo in a respectable time. And thus, I have some thoughts on what I would like to see happen with this library:
Memory The most glaring thing currently with regards to the S3, is that because ML-models doesn't fit in ram, the tensor arena in its entirety ends up in PSRAM. Initially it seemed like the MicroAllocater could be used to allocate a non-persistent part of the tensor arena to SRAM. But it seemed to have been protected in some C++ way, and I couldn't make that work earlier. It would appear to me even now this must be slowing down inference by several times. Either way, the memory handling must evolve significantly (I mean, there is TCM, don't let PSRAM spoil that) Because If this would stay the same on the P4, it would nullify most benefits, as it would just wait on PSRAM access all the time. It would just be like 10% faster, which would be sad.
XAI Extensions If the memory starts moving, the extensions, and the FPU(!) would start to really matter, are there any plans on a custom kernel that exploits those?
That was just my 5 ¢. But I feel it is important for your customers to know a little bit what is planned down the line now. If the ESP32-P4 would make the ESP32 seriously usable for a real YOLO model, for example, it would be a game changer with a lot of use cases where you now have to involve an RPI 5 or something instead. Which while less attractive to me from a product development and deployment standpoint, I know will do the job.