ARM-software / CMSIS_5

CMSIS Version 5 Development Repository
http://arm-software.github.io/CMSIS_5/index.html
Apache License 2.0
1.33k stars 1.08k forks source link

CMSIS DAP delay is optimized out by gcc with aggressive optimization options #389

Open hugeone opened 6 years ago

hugeone commented 6 years ago

In fuction PIN_DELAY_SLOW:

__STATIC_FORCEINLINE void PIN_DELAY_SLOW (uint32_t delay) {
  uint32_t count;

  count = delay ;
  while (--count);
}

the loop is getting optimized out with -O3 option.

Proposed solution: Make the count volatile or force the compiler to load the parameter into the register (which is my preferred option):

__STATIC_FORCEINLINE void PIN_DELAY_SLOW (uint32_t delay) {
  uint32_t count;

  count = delay ;
  while (--count)  __asm__("" : "=r" (delay));
}

which generates very efficient code:

288         while (--count)  __asm__("" : "=r" (delay));
0800c5c8:   beq.w   0x800cac0 <SWD_TransferSlow+1296>
0800c5cc:   mov     r2, r3
0800c5ce:   subs    r2, #1
0800c5d0:   bne.n   0x800c5ce <SWD_TransferSlow+30>
JonatanAntoni commented 6 years ago

Hi @hugeone,

Thanks for pointing this out. It will take us some time to review your proposals due to vacation period.

My first guess is that adding volatile is better portable than mixing in inline assembly. The loop is only for having a certain delay hence efficient code should not be a major concern here. Any further thoughts?

Cheers, Jonatan

hugeone commented 6 years ago

Hi @JonatanAntoni , volatile will add some store instructions as well so the delay will have to be recalculated as well

JonatanAntoni commented 6 years ago

@hugeone, sure. The correct solution depends on the actual accuracy needed. We need to analyze that in detail.

RadioOperator commented 4 years ago

Hi, @hugeone, Your solution solved my AC6[-O0] problem in the CMSIS-DAP_for_STLINK-V3MINI. (Tested the "volatile" only). @JonatanAntoni , hope your improvement asap. Thanks.

JonatanAntoni commented 4 years ago

Hi @RadioOperator,

unfortunately we don't have time to address this on the short run. Please stick to the local fix that looks most appropriate for you for the time being.

Cheers, Jonatan

RobertRostohar commented 4 years ago

Unfortunately adding volatileintroduces overhead instructions that will effect the maximum pin toggling frequency.

Therefore the delay loop should be implemented in assembly.

RobertRostohar commented 4 years ago

The updated PIN_DELAY_MACRO should now not be optimized out anymore (tested with ARMClang and GCC).

RadioOperator commented 4 years ago

@RobertRostohar Hi, I tested your new code under MDK+AC6 [-Ofast / -Os / -Oz]. The PIN delay timing is OK. But, the Delayms(uint32_t delay) function in DAP.c: Line-148, not stable. When I call Delayms(100), I got 66ms in [-Os, SW mode], and 33ms in all other cases. I do not know if this needs to improve, or just leave it as-is. Thanks.

RobertRostohar commented 4 years ago

Are you sure? I have tried with AC6 and I get the same code generated for the assembly loop regardless of optimization. I get something like this (CPU_CLOCK=180M):

   148: void Delayms(uint32_t delay) { 
0x14000410 F64E2160  MOVW          r1,#0xEA60
   149:   delay *= ((CPU_CLOCK/1000U) + (DELAY_SLOW_CYCLES-1U)) / DELAY_SLOW_CYCLES; 
   150:   PIN_DELAY_SLOW(delay); 
0x14000414 4348      MULS          r0,r1,r0
   289:   __ASM volatile ( 
0x14000416 1E40      SUBS          r0,r0,#1
0x14000418 D1FD      BNE           0x14000416
   151: } 
0x1400041A 4770      BX            lr
RadioOperator commented 4 years ago

@RobertRostohar I got almost same as yours. In the 1st line: MOVW r1,#0xEA60, 0xEA60 = 6000 = 180M/1000/3 Does this make my Delayms(100) delays 33ms, not 100ms?

------------------my CPU_CLOCK is 216M----------:

   148: void Delayms(uint32_t delay) { 
0x08001DAA F6411140  MOVW          r1,#0x1940       //0x00011940 = 72000 = 216M/1000/3
0x08001DAE F2C00101  MOVT          r1,#0x01
   149:   delay *= ((CPU_CLOCK/1000U) + (DELAY_SLOW_CYCLES-1U)) / DELAY_SLOW_CYCLES; 
   150:   PIN_DELAY_SLOW(delay); 
0x08001DB2 4348      MULS          r0,r1,r0
   289:   __ASM volatile ( 
0x08001DB4 1E40      SUBS          r0,r0,#1
0x08001DB6 D1FD      BNE           0x08001DB4
   151: } 
0x08001DB8 4770      BX            lr
RobertRostohar commented 4 years ago

Your code looks as expected. The inner loop will take 216k cycles (72000*3) for each millisecond specified with delay parameter. Delayms(100) should take around 21.6M cycles which should be 100ms when CPU runs at 216MHz.

RadioOperator commented 4 years ago

@RobertRostohar Hi, but my problem is : When I call Delayms(100), I get 33ms delay time actually, not 100ms. I don't know why.

RobertRostohar commented 4 years ago

Try to debug it and check also how much time is a single cycle on your hardware. You might have additional latency for opcode fetch ...

RadioOperator commented 4 years ago

@RobertRostohar I have done many test. The Delayms(x) accuracy, related AC6 [-opt] and PIN_DELAY_SLOW base. Delayms(100), I got two data (66ms and 33ms) only.

For PIN_DELAY_SLOW, I got: When Keil set SW clock = 10MHz, PIN_DELAY_SLOW ~ 50ns, SWCLK clock ~8MHz. When Keil set SW clock = 1MHz, PIN_DELAY_SLOW ~ 200ns. SWCLK clock ~2.3MHz.

Could you tell me if this is in specification, or not.

Thanks.

RobertRostohar commented 4 years ago

I'm not sure why you are mixing the Delayms function with pin toggling frequency of SWCLK.

Delayms function is intended as a helper for the main application when there is no RTOS used (for example controlling the LED blink sequence at startup). It does not effect the frequency of SW/JTAG pin toggling. That is controlled with the PIN_DELAY_SLOW/FAST macros.

The configured SW clock from the debugger is the maximum frequency allowed. The actual frequency can be lower. Typically the actual frequency is lower due to rounding to nearest clock divider and overhead of pin toggling (affected at higher frequencies).

When Keil set SW clock = 10MHz, PIN_DELAY_SLOW ~ 50ns, SWCLK clock ~8MHz.

Looks OK.

When Keil set SW clock = 1MHz, PIN_DELAY_SLOW ~ 200ns. SWCLK clock ~2.3MHz.

Doesn't look right. It should be: PIN_DELAY_SLOW >= 500ns, SWCLK <= 1MHz.

You should debug the function DAP_SWJ_Clock when debugger on the PC sets the SW clock to 1MHz and examine the delay calculated and stored in DAP_Data.clock_delay. This value should be 36 when using a 216MHz CPU clock.

RadioOperator commented 4 years ago

@RobertRostohar I'm reporting the behaviors of the current full package code, especially included your new improvement on PIN_DELAY_SLOW. My test code is using the original, no any modification on the delay timing parts. I'll list out the testing results for your reference later. Thanks.

RadioOperator commented 4 years ago

@RobertRostohar Here it is: (your new commit for PIN_DELAY_SLOW added in the test code) PIN_DELAY_SLOW

The data is measured by an oscilloscope, so just for reference.

Two problem:

  1. SWCLK frequency not following the Keil setup, please confirm it on your side.
  2. Delayms() function. Thanks.
RobertRostohar commented 4 years ago

I’m not sure how you measure Delayms(100) with an oscilloscope. As already mentioned the Delayms function is not used in the SW/JTAG pin toggling.

Also you should not test optimization -O0 since there will be too much overhead in the pin toggling. Normally you would use a high optimization.

Anyway it seems that you have strange SW/JTAG clocks. Actual clock should always be lower than the specified clock from the PC side.

Either you have an issue in the DAP firmware port or the PC debugger sends the wrong commands.

Are you using MDK and CMSIS-DAP Debugger on the PC side? Are you using the original DAP Firmware project from this GitHub repo? Did you port the DAP Firmware correctly? Did you update the DAP.h to use the updated macros?

RadioOperator commented 4 years ago

@RobertRostohar Hi, My code uses the Delayms(100) to trigger nRESET pin, so I noticed it is not 100ms, anyway, ignore this for no one using it.

SW/JTAG clocks do not make me CMSIS-DAP functions failure, even it's fast more actually in lower speed cases, the DAP user can not know it. If this is a "system" problem, you decide to be improved or not.

The test environment: I'm using Win64, PC, MDK, CMSIS-DAP debugger, Target board is STM32F103 / STM32F407. CMSIS-DAP hardware + Firmware, please refer my repo: CMSIS-DAP_for_STLINK-V3MINI.

My DAP porting, using original files from MDK installation, replaced your new DAP.h file for testing. (I have not update it into my repo). The firmware structures is almost same as LPC-Link-II example.

Thanks.

RobertRostohar commented 4 years ago

I have ported your project on a slightly different board which has the same processor and measured similar clock frequencies as you have reported.

It tuns out that you are using a Cortex-M7 processor which has an in-order super-scalar pipeline that means many instructions can be dual-issued.

This effects also the PIN_DELAY_SLOW macro, especially how many cycles are required for one iteration. By default 3 cycles are expected but in case of Cortex-M7 this value should be 1.

You should define DELAY_SLOW_CYCLES as 1 in your project which should result in clocks being lower than the max specified frequency and resolve the issue that you have reported.

RobertRostohar commented 4 years ago

You should not use Arm/Keil registered USB Vendor ID (0xC251) in you project! You need to obtain your own.

RadioOperator commented 4 years ago

@RobertRostohar Hi, thanks, I'll test your suggestion. Regarding the USB vendor ID, I do not know how to do. Should I use an auto-generated and pre-defined USB-ID from ST software evaluation kit? If I use any other ID, maybe cause a driver problem to Host program. Btw, my "project" is a non-profit home DIY software only. It's a CMSIS-DAP implement sample as LPC-Link-II.

RobertRostohar commented 4 years ago

USB Vendor IDs are unique and registered to a company. They can be obtained from USB.org: https://www.usb.org/getting-vendor-id

flit commented 4 years ago

@RadioOperator Debuggers detect CMSIS-DAP devices by matching against the string "CMSIS-DAP" in the device's product name or interface name string, not VID/PID combinations.

Some silicon vendors have a program where customers can allocate a product ID and then use the vendor's USB vendor ID. I do not know if ST provides such a program.

RadioOperator commented 4 years ago

@flit Thanks, I can change the VID/PID to ST software kit default value later.

RadioOperator commented 4 years ago

@RobertRostohar Hi, I'm still testing it. Based your code:

__STATIC_FORCEINLINE void PIN_DELAY_SLOW (uint32_t delay) {
  __ASM volatile (
  ".syntax unified\n"
  "0:\n\t"
    "subs %0,%0,#1\n\t"
    "bne  0b\n"
  : "+l" (delay) : : "cc"
  );
}

After AC6[-Ofast] passed, the generated ASM instructions is unstable, like:

.Ltmp14:
    subs    r7, r7, #1
    bne .Ltmp14

here, r7, not fixed "r7" in different "inline" parts in sw_dp/jtag_dp files, could be r0,r1,r2...... I need it using "r0" only, because R0 is more faster than R1 (a half of the mach-cycles). I'm not a good asm-lang coder, please help me or give me some tips. Thanks.

RobertRostohar commented 4 years ago

the generated ASM instructions is unstable

No. This is expected. The compiler decides which register to use when the function is inlined (based on other code and optimizations).

because R0 is more faster than R1

No. Same subs instruction which only uses a different register (R0 or R1 or R7) will take the same time.

I would not recommend going on ASM level and rather leave this to the compiler.

You can explore this on your own but this is out of scope for the CMSIS-DAP project.

RadioOperator commented 4 years ago

@RobertRostohar Thanks and understood. I'm struggling on the maximum link-speed, still not put my 216MHz MCU into the limits.

RadioOperator commented 4 years ago

@RobertRostohar , Hi, finally I got the max SWD clock 18MHz, using configs: DELAY_SLOW_CYCLES = 1; DAP_Data.clock_delay = 1 (minimum, SWD mode) added 2 NOP in PIN_DELAY_SLOW(...):

  __ASM volatile (
  ".syntax unified\n"
  "0:\n\t"
    "subs %0,%0,#1\n\t"
    "bne  0b\n\t"
    "nop\n\t"
    "nop\n"
  : "+l" (delay) : : "cc"
  );

Thanks.