STM32CubeIDE Default Code for Nucleo H743ZI Includes Ethernet Bug

RJ-400 commented 3 years ago

CubeMX provides a default Ethernet configuration for Nucleo boards incorporating an unassigned NULL pointer to which the MAC address is written. When written, this corrupts the zero wait state, deterministic tightly coupled instruction memory (ITCM) located at 0x00000000 such that anyone locating code there will face a particularly nasty bug, just when they're making the already challenging leap, attempting to load code to and run code from ITCM, which in principle doesn't need to be so hard.

The STM32H7 series hardware is AMAZING. The STM32CubeIDE and the low cost Nucleo boards are invaluable. But it's a long road to getting H7 devices to deliver the performance they're capable of, like 550 Mflops from a sub $3 H730VB device, for example! Rather than wasting nearly all that immense computational power faster than the hardware guys can serve it up, software guys should count clock cycles, look at the often ENORMOUS gaps between what they're getting and what the hardware was brilliantly engineered to achieve, and close the gaps... completely. It takes much more than clock configuration. To get there, you need:

Both ICache and Dcache, understanding their impact on zero wait state performance and single cycle instruction execution as well as the overhead, risks and complications involved in keeping the regular RAM in sync with the cache.
Both ITCM and DTCM (tightly coupled memory), understanding their more deterministic impact on zero wait state performance and single cycle execution
The ability to use linker scripts and startup assembly code to configure memory in a way that locates code, initialized data and uninitialized data in TCM. (AFIK, Cube doesn't help with this.)
The ability to load code and data from ultra cheap 2MB-256MB external serial flash into internal SRAM, especially zero wait state tightly coupled internal SRAM. (Executing at zero wait states from flash is nice, and ST's approaches to getting zero wait state performance from flash are superb, but if you have a lot of code and you put it all in internal flash, the cost per MB for that flash will be high.)
The ability to move code AT RUNTIME into the RAM section of choice.
A little bit of tight assembly language for DSP (DSP coding with VMUL.F32 and VADD.F32 is even easier than coding in C. Make good use of all 32 single precision floating point registers. Make good use of the 64 bit AXI bus. Make good use of the ability to execute loads and stores to and from TCM memory IN PARALLEL with most (but not MAC type) floating point calculations, all in a single clock cycle!
A little bit of tight assembly for moving data.
A clear understanding of how to get single cycle execution of floating point instructions. It appears that on the STM32F7 and H7, due to the nature of the pipeline that enables their incredible performance, you can't use the result of a floating point operation until THE THIRD clock cycle after the result is calculated. This is a pretty serious constraint, but you can still get great results under many circumstances if you write your code around it, but you have to know about it first, and it's a major science project to figure out all this vital (but undocumented?) stuff. Similarly, the cortex M7 floating point MAC instructions such as vfma.f32 and vmla.f32 CAN deliver single cycle performance, but not when loading/storing in parallel, and apparently, only when executing a sequence of instructions consisting only of MAC instructions. Why aren't these characteristics in Arm's Cortex M7 Technical Reference Manual?? At least some of these characteristics ARE addressed for the M4 in the Cortex M4 Technical Reference Manual. Are these factors for the M7 based devices addressed anywhere in ST documentation? Shouldn't this information be in the programming manuals? The datasheet? The reference manual? If not, where should it be?
The ability to write assembly code in a separate file, rather than using inline assembly, avoiding all that inline messiness and chaos.
The use of underscore underscore " attribute((section (".TCM_Code"), aligned(4))) " or similar specifiers before C function definitions and variable declarations to put code and data where you want it in RAM. Similarly, the use of " .section .TCM_Code.MyASMFunction " above assembly code routines in .s assembly files to designate where in memory they should be located. External assembly functions that are part of a project may be called from C after declaring them with " extern uint32_t MyASMFunction(uint32_t n, float* pbuf, float a, float b, float c, float d) asm ("MyASMFunction"); ".
Knowledge of the simple Arm EABI procedure call standards to efficiently call assembly code from C without wasted call overhead.
DMA! MDMA! It's not a hassle. It's a sublime luxury, and the H7 is DMA nirvana. Cube is great for setting it up, but you need to be looking at the programming manual while you do it, at least the first couple of times.
A cheap Saleae Logic probe for many reasons, including for counting cycles, because the numbers you get from the DWT cycle counter are often inaccurate. (If it's not fundamentally impossible due to the pipeline, we need a cycle counter that provides an accurate count in CubeIDE. Is this an Arm (Cortex) problem, an STM32 problem or a CubeIDE problem?)

Much of this approach to performance is contrary to prevailing mindsets. Here's why the majority is wrong. Most computationally intensive code doesn't come close to achieving a floating point operation per clock cycle. It might be more like 20 clock cycles per floating point operation when all the application overhead is accounted for. Writing assembly language to save a clock cycle per flop under such circumstances only speeds things up by about 5%, and most think it's not worth it. But if you put in the effort to get it down to just two cycles per flop, the effort put into the last wasted clock cycle takes you from 275 Mflops to 550, effectively doubling the computational power of your design. Unlike many things, in this case, the closer you get to perfect, the GREATER the rewards for improvements. This isn't governed by the law of diminishing returns. Clock cycles are in the denominator. This is governed by the law of exploding returns.

If you recognize that the difference between a 200MHz device and a 400MHz device matters, you have to recognize that doubling performance "for free" by eliminating the last wasted clock cycle also matters.

But when you're starting out with one of these remarkable STM32 chips on a Nucleo board using MX and the CubeIDE, you're not really pointed in the right direction on the path to getting there, for a number of reasons, and I think they're rooted in the fact that the people developing CubeMX, the IDE, the low level drivers, the HAL and the examples aren't ever trying to make the devices do what they're capable of. If they were, the system and process would unfold differently. It's easy to blame it all on a bloated and buggy HAL that's nonetheless necessary and priceless as a means to timely basic functional success with a device that has 3000+ pages of documentation. Cube provides a starting point from which a streamlined design may be developed. But the knowledge necessary to make the H7, for example, go fast, doesn't unfold from the MX and Cube approach. Mflops should be the top priority in some of the examples, especially for devices like the H7. The "Blinky"s and "Hello world!"s should become "here's how we sum 64k floating point numbers in RAM in as close to one flop per max clock speed clock cycle". Some should use the I and D caches. Some should use TCM. Some should use MDMA. As fast as possible should be the default in many of the examples.

What we have now is a bunch of newbies not getting past something akin to 16 MHz blinky with 7 wait states on ultra high performance hardware... (Okay, I'm exaggerating a bit). Even the experts are saying you don't need assembly language, and in the rare instances that you do, inline assembly is all you'll ever need. This is wrong. A platform like the STM32H7 is made for the performance that comes with zero wait states and tight assembly. In their development platforms, it's as if ST is using a Bugatti to pull a wagon on a hay ride.

You get your fourth new Nucleo board weeks after the first Nucleo H7's become available, kind of set it up to go fast in CubeIDE, leaving the default devices on the board enabled, like Ethernet, for example, because you might need it soon, then you start to learn. When you eventually figure out that you should make use of TCM and put code there, starting at 0x00000000 or course, everything comes crashing down, and you think it's because you really might not be smart enough to write code for an STM32H7. Then you'll probably give up. But if you don't, eventually you'll figure out that in its init function, the Ethernet implementation provided by MX is writing the MAC address to an unassigned NULL pointer aimed right at the code you're trying to run in ITCM. You'll have to look at a memory dump to see what's going on. It was a bad default configuration generated by CubeMX, one that guarantees that pretty much everyone trying to make the fast microcontroller on their Nucleo board go fast in a deterministic way will fail. But that's the most important and valuable characteristic of the H7! Not good for sales. Yet years go by and nobody seems to care enough to figure it out, let alone fix it. But they knew there was a problem. That's why they put a warning in MX. Perhaps Nucleo boards and CubeIDE really were just for sales engineers who needed a snazzy way to run blinky. No. They're too valuable. But how could you let this go so long? It must be that virtually no one is making these boards go fast, whether they're customers or ST engineers. If they were, bugs like this would have been uncovered before the board was released, or worst case, shortly afterward, and getting performance from these devices wouldn't require weeks or months of detective work and reverse engineering to understand what ST should have conveyed in a 20 page document. This is a huge waste. Your hardware engineers should be fuming. Your software engineers should be dissatisfied. Your marketing people should be livid, at least the ones who aren't *#@($)! enough to think that crippling high performance platforms enables upselling.

There's a big problem with tech education in America. The gap between textbook knowledge and its application is way too large. People should be able to take a couple of DSP courses and, at home or as part of a lab, sit down with a Nucleo board and run their filters and algorithms at the speed that modern, cost effective hardware is capable of.

The mindset should not be "Start with low performance on high performance hardware, because, for those getting started, high performance doesn't matter. We'll save that for the elite experts who have the decades of experience to navigate our elite sophisticated product."

The mindset should be "Start with high performance on high performance hardware, because that's what were selling, that's what the customer is buying, and for those getting started (and everyone else), high performance does matter, especially to those who don't want to get left in the dust." Extend the Cube philosophy by adding UI functionality for configuring devices appropriately and optimally for DSP, use of TCM, code location, assembly language, maximum floating point operations per second, etc. If you did that, some of the ST software bugs that cripple new users of your hardware would never occur.

Maybe you could add a performance tab page to Cube. On it, you could configure code location, runtime relocation with function pointers, assembly language stubs, C to assembly interfacing, alignment, linker script generation, startup file generation and whatever else is necessary to get as close as possible to one floating point operation per clock cycle out of both TCM and cache, and provide instruction timing characteristics, constraints and code examples. Another tab page could be added for DSP, setting up biquads, basic IIR and FIR filters and matrix operations using C, assembly and CMSIS approaches.

Hope you can get the Ethernet bug fixed.

ALABSTM commented 3 years ago

Hi @RJ-400,

Thank you for this post. It has been a real pleasure to read it till the last word. We do appreciate such a positive feedback about our products.

Regarding the issue you pointed out, we will forward it to our development teams for deeper analysis before having it fixed.

Regarding the other points you mentioned, a list will be forwarded to our technical committee to study the possibility to integrate such enhancements into our software offer.

Thank you again for all what you wrote. I will get back to you as soon as I have any news. In advance, thank you for your patience.

With regards,

thomask77 commented 3 years ago

Most likely it's the same bug that I reported a year ago:

0-Pointer access in generated MX_ETH_Init code

You really have to take this issue seriously and solve it in ST. Otherwise it falls between the cracks of CubeMX and HAL Library.

ALABSTM commented 3 years ago

Hi @RJ-400,

I hope you are fine. Does the workaround suggested by @thomask77 solves the issue? Thank you for your reply.

With regards,

ALABSTM commented 3 years ago

Hi @RJ-400,

I hope you are fine. I also hope you could overcome the issue you were facing. Please allow me to close this issue as not reply from your side has been received since a couple of months. Do not hesitate to reopen it in case you think it is still relevant.

With regards,

thomask77 commented 3 years ago

@ALABSTM Please do not auto-close such bugs before there are fixed.

Just because @RJ-400 and I lost interest in reporting it, doesn't mean the problem is solved for all other people :/

This bug and https://github.com/STMicroelectronics/STM32CubeH7/issues/33 is a serious defect that must be fixed by you.

STMicroelectronics / STM32CubeH7

STM32CubeIDE Default Code for Nucleo H743ZI Includes Ethernet Bug #113