Traumflug / Teacup_Firmware

Firmware for RepRap and other 3D printers
http://forums.reprap.org/read.php?147
GNU General Public License v2.0
312 stars 199 forks source link

(u)int_fast(8/16)_t can bring more performance for ARM #262

Open Wurstnase opened 7 years ago

Wurstnase commented 7 years ago

Hi all,

did you ever think about using uint_fast8_t? Any drawbacks? I just replaced only the variables which are going in the stepper interrupt. On the max time this becomes a performance boost of ~3.5% (280 ticks vs 270 ticks).

https://gist.github.com/Wurstnase/5bc0bcbf08c7a40acded14bf30610704

new:
min:133 max:270 avg:154 n:19480
old:
min:128 max:280 avg:159 n:19480
phord commented 7 years ago

Interesting. Drawbacks would be that you have to be careful to manage overflow effects. For example, if I have

uint8_t fn(uint8_t fp) {
    static uint8_t remainder;
    remainder += fp;
    ...
    return remainder;
}

Then the range of remainder is 0-255. If you change this to uint_fast8_t, then the remainder range is platform-specific. We would like it to still be 0-255 so it wraps around when you exceed 255. But on a {16,32,64}-bit system it will be much bigger and the overflow we depended on before would not happen. Maybe it would be wrapped ok in the return statement (if the return type is uint8_t still), but any use inside the function might not.

It's probably safe to use in specific instances if you carefully consider these kinds of limits. I'm surprised the compiler doesn't already know how to optimize around this sort of thing. The uint_fast8_t type would not have as many language-imposed runtime checks (wrapping, for example) that the compiler had to take care of, I guess. Have you compared the assembly output of the two versions?

Wurstnase commented 7 years ago

Good idea to check the assembly.

The ARM has an instruction for byte, half-word and word. So with uint8_t it uses ldrb and with uint_fast8_t it uses ldr. ldrb is extended by zeros. In the end written again with strb.

I also checked some other test-gcode and it looks like, that the benefit of the 10 cycles was just in my short example. With longer gcode the improvement is negligible or negative.

I will also check later some other points of the code.

Wurstnase commented 7 years ago

Not worthwhile...

Wurstnase commented 7 years ago

Not worthwhile... Not true...

Any time the ARM need the number for calculations this will decrease the flash size.

For example:

uint16_t analog_read(uint8_t index) {
  if (NUM_TEMP_SENSORS > 0) {
    uint16_t r = 0;
    uint16_t temp;
    uint16_t max_temp = 0;
    uint16_t min_temp = 0xffff;

    for (uint8_t i = 0; i < OVERSAMPLE; i++) {
      temp = adc_buffer[i][index];
      max_temp = max_temp > temp ? max_temp : temp;
      min_temp = min_temp < temp ? min_temp : temp;
      r += temp;
    }

    r = (r - max_temp - min_temp) / (OVERSAMPLE - 2);
    return r;
  } else {
    return 0;
  }
}

This is the current function on stm32 to get the last analog read. Changing the for loop from uint8_t to uint_fast8_t or uint32_t will change nothing. Assembly is absolutely the same.

Different for r and temp. 8 byte more flash for the 32bit-version. (Without -flto there is not flash difference. But the analog_read has one additional command !?! :confused: )

Completely different for max_temp and min_temp. 4 bytes less with the 32bit version with -flto. (16 bytes without LTO). Ram size is the same.

Also changing clock-variables to uint_fast8_t will decrease flash-size by 16 bytes, by changing clock_counter/flag_XXms. It increase the ram by 16 bytes in this case.

So changing some part of the code could increase the performance. Especially code only for the ARM could easily changed to 32bit, when it helps.

Wurstnase commented 7 years ago

https://github.com/Traumflug/Teacup_Firmware/commit/9fbfee56b0ca

Saves 88 bytes of flash and costs 28 bytes of RAM.