STMicroelectronics / STM32CubeH7

STM32Cube MCU Full Package for the STM32H7 series - (HAL + LL Drivers, CMSIS Core, CMSIS Device, MW libraries plus a set of Projects running on all boards provided by ST (Nucleo, Evaluation and Discovery Kits))
https://www.st.com/en/embedded-software/stm32cubeh7.html
Other
493 stars 303 forks source link

STM32CubeIDE Default Code for Nucleo H743ZI Includes Ethernet Bug #113

Closed RJ-400 closed 3 years ago

RJ-400 commented 3 years ago

CubeMX provides a default Ethernet configuration for Nucleo boards incorporating an unassigned NULL pointer to which the MAC address is written. When written, this corrupts the zero wait state, deterministic tightly coupled instruction memory (ITCM) located at 0x00000000 such that anyone locating code there will face a particularly nasty bug, just when they're making the already challenging leap, attempting to load code to and run code from ITCM, which in principle doesn't need to be so hard.

The STM32H7 series hardware is AMAZING. The STM32CubeIDE and the low cost Nucleo boards are invaluable. But it's a long road to getting H7 devices to deliver the performance they're capable of, like 550 Mflops from a sub $3 H730VB device, for example! Rather than wasting nearly all that immense computational power faster than the hardware guys can serve it up, software guys should count clock cycles, look at the often ENORMOUS gaps between what they're getting and what the hardware was brilliantly engineered to achieve, and close the gaps... completely. It takes much more than clock configuration. To get there, you need:

Much of this approach to performance is contrary to prevailing mindsets. Here's why the majority is wrong. Most computationally intensive code doesn't come close to achieving a floating point operation per clock cycle. It might be more like 20 clock cycles per floating point operation when all the application overhead is accounted for. Writing assembly language to save a clock cycle per flop under such circumstances only speeds things up by about 5%, and most think it's not worth it. But if you put in the effort to get it down to just two cycles per flop, the effort put into the last wasted clock cycle takes you from 275 Mflops to 550, effectively doubling the computational power of your design. Unlike many things, in this case, the closer you get to perfect, the GREATER the rewards for improvements. This isn't governed by the law of diminishing returns. Clock cycles are in the denominator. This is governed by the law of exploding returns.

If you recognize that the difference between a 200MHz device and a 400MHz device matters, you have to recognize that doubling performance "for free" by eliminating the last wasted clock cycle also matters.

But when you're starting out with one of these remarkable STM32 chips on a Nucleo board using MX and the CubeIDE, you're not really pointed in the right direction on the path to getting there, for a number of reasons, and I think they're rooted in the fact that the people developing CubeMX, the IDE, the low level drivers, the HAL and the examples aren't ever trying to make the devices do what they're capable of. If they were, the system and process would unfold differently. It's easy to blame it all on a bloated and buggy HAL that's nonetheless necessary and priceless as a means to timely basic functional success with a device that has 3000+ pages of documentation. Cube provides a starting point from which a streamlined design may be developed. But the knowledge necessary to make the H7, for example, go fast, doesn't unfold from the MX and Cube approach. Mflops should be the top priority in some of the examples, especially for devices like the H7. The "Blinky"s and "Hello world!"s should become "here's how we sum 64k floating point numbers in RAM in as close to one flop per max clock speed clock cycle". Some should use the I and D caches. Some should use TCM. Some should use MDMA. As fast as possible should be the default in many of the examples.

What we have now is a bunch of newbies not getting past something akin to 16 MHz blinky with 7 wait states on ultra high performance hardware... (Okay, I'm exaggerating a bit). Even the experts are saying you don't need assembly language, and in the rare instances that you do, inline assembly is all you'll ever need. This is wrong. A platform like the STM32H7 is made for the performance that comes with zero wait states and tight assembly. In their development platforms, it's as if ST is using a Bugatti to pull a wagon on a hay ride.

You get your fourth new Nucleo board weeks after the first Nucleo H7's become available, kind of set it up to go fast in CubeIDE, leaving the default devices on the board enabled, like Ethernet, for example, because you might need it soon, then you start to learn. When you eventually figure out that you should make use of TCM and put code there, starting at 0x00000000 or course, everything comes crashing down, and you think it's because you really might not be smart enough to write code for an STM32H7. Then you'll probably give up. But if you don't, eventually you'll figure out that in its init function, the Ethernet implementation provided by MX is writing the MAC address to an unassigned NULL pointer aimed right at the code you're trying to run in ITCM. You'll have to look at a memory dump to see what's going on. It was a bad default configuration generated by CubeMX, one that guarantees that pretty much everyone trying to make the fast microcontroller on their Nucleo board go fast in a deterministic way will fail. But that's the most important and valuable characteristic of the H7! Not good for sales. Yet years go by and nobody seems to care enough to figure it out, let alone fix it. But they knew there was a problem. That's why they put a warning in MX. Perhaps Nucleo boards and CubeIDE really were just for sales engineers who needed a snazzy way to run blinky. No. They're too valuable. But how could you let this go so long? It must be that virtually no one is making these boards go fast, whether they're customers or ST engineers. If they were, bugs like this would have been uncovered before the board was released, or worst case, shortly afterward, and getting performance from these devices wouldn't require weeks or months of detective work and reverse engineering to understand what ST should have conveyed in a 20 page document. This is a huge waste. Your hardware engineers should be fuming. Your software engineers should be dissatisfied. Your marketing people should be livid, at least the ones who aren't *#@($)! enough to think that crippling high performance platforms enables upselling.

There's a big problem with tech education in America. The gap between textbook knowledge and its application is way too large. People should be able to take a couple of DSP courses and, at home or as part of a lab, sit down with a Nucleo board and run their filters and algorithms at the speed that modern, cost effective hardware is capable of.

The mindset should not be "Start with low performance on high performance hardware, because, for those getting started, high performance doesn't matter. We'll save that for the elite experts who have the decades of experience to navigate our elite sophisticated product."

The mindset should be "Start with high performance on high performance hardware, because that's what were selling, that's what the customer is buying, and for those getting started (and everyone else), high performance does matter, especially to those who don't want to get left in the dust." Extend the Cube philosophy by adding UI functionality for configuring devices appropriately and optimally for DSP, use of TCM, code location, assembly language, maximum floating point operations per second, etc. If you did that, some of the ST software bugs that cripple new users of your hardware would never occur.

Maybe you could add a performance tab page to Cube. On it, you could configure code location, runtime relocation with function pointers, assembly language stubs, C to assembly interfacing, alignment, linker script generation, startup file generation and whatever else is necessary to get as close as possible to one floating point operation per clock cycle out of both TCM and cache, and provide instruction timing characteristics, constraints and code examples. Another tab page could be added for DSP, setting up biquads, basic IIR and FIR filters and matrix operations using C, assembly and CMSIS approaches.

Hope you can get the Ethernet bug fixed.

ALABSTM commented 3 years ago

Hi @RJ-400,

Thank you for this post. It has been a real pleasure to read it till the last word. We do appreciate such a positive feedback about our products.

Regarding the issue you pointed out, we will forward it to our development teams for deeper analysis before having it fixed.

Regarding the other points you mentioned, a list will be forwarded to our technical committee to study the possibility to integrate such enhancements into our software offer.

Thank you again for all what you wrote. I will get back to you as soon as I have any news. In advance, thank you for your patience.

With regards,

thomask77 commented 3 years ago

Most likely it's the same bug that I reported a year ago:

0-Pointer access in generated MX_ETH_Init code

You really have to take this issue seriously and solve it in ST. Otherwise it falls between the cracks of CubeMX and HAL Library.

ALABSTM commented 3 years ago

Hi @RJ-400,

I hope you are fine. Does the workaround suggested by @thomask77 solves the issue? Thank you for your reply.

With regards,

ALABSTM commented 3 years ago

Hi @RJ-400,

I hope you are fine. I also hope you could overcome the issue you were facing. Please allow me to close this issue as not reply from your side has been received since a couple of months. Do not hesitate to reopen it in case you think it is still relevant.

With regards,

thomask77 commented 3 years ago

@ALABSTM Please do not auto-close such bugs before there are fixed.

Just because @RJ-400 and I lost interest in reporting it, doesn't mean the problem is solved for all other people :/

This bug and https://github.com/STMicroelectronics/STM32CubeH7/issues/33 is a serious defect that must be fixed by you.