Open Wurstnase opened 7 years ago
Interesting. Drawbacks would be that you have to be careful to manage overflow effects. For example, if I have
uint8_t fn(uint8_t fp) {
static uint8_t remainder;
remainder += fp;
...
return remainder;
}
Then the range of remainder
is 0-255. If you change this to uint_fast8_t, then the remainder
range is platform-specific. We would like it to still be 0-255 so it wraps around when you exceed 255. But on a {16,32,64}-bit system it will be much bigger and the overflow we depended on before would not happen. Maybe it would be wrapped ok in the return statement (if the return type is uint8_t
still), but any use inside the function might not.
It's probably safe to use in specific instances if you carefully consider these kinds of limits. I'm surprised the compiler doesn't already know how to optimize around this sort of thing. The uint_fast8_t
type would not have as many language-imposed runtime checks (wrapping, for example) that the compiler had to take care of, I guess. Have you compared the assembly output of the two versions?
Good idea to check the assembly.
The ARM has an instruction for byte, half-word and word. So with uint8_t
it uses ldrb
and with uint_fast8_t
it uses ldr
. ldrb
is extended by zeros. In the end written again with strb
.
I also checked some other test-gcode and it looks like, that the benefit of the 10 cycles was just in my short example. With longer gcode the improvement is negligible or negative.
I will also check later some other points of the code.
Not worthwhile...
Not worthwhile... Not true...
Any time the ARM need the number for calculations this will decrease the flash size.
For example:
uint16_t analog_read(uint8_t index) {
if (NUM_TEMP_SENSORS > 0) {
uint16_t r = 0;
uint16_t temp;
uint16_t max_temp = 0;
uint16_t min_temp = 0xffff;
for (uint8_t i = 0; i < OVERSAMPLE; i++) {
temp = adc_buffer[i][index];
max_temp = max_temp > temp ? max_temp : temp;
min_temp = min_temp < temp ? min_temp : temp;
r += temp;
}
r = (r - max_temp - min_temp) / (OVERSAMPLE - 2);
return r;
} else {
return 0;
}
}
This is the current function on stm32 to get the last analog read. Changing the for loop from uint8_t to uint_fast8_t or uint32_t will change nothing. Assembly is absolutely the same.
Different for r
and temp
. 8 byte more flash for the 32bit-version. (Without -flto there is not flash difference. But the analog_read has one additional command !?! :confused: )
Completely different for max_temp and min_temp. 4 bytes less with the 32bit version with -flto. (16 bytes without LTO). Ram size is the same.
Also changing clock-variables to uint_fast8_t will decrease flash-size by 16 bytes, by changing clock_counter/flag_XXms. It increase the ram by 16 bytes in this case.
So changing some part of the code could increase the performance. Especially code only for the ARM could easily changed to 32bit, when it helps.
https://github.com/Traumflug/Teacup_Firmware/commit/9fbfee56b0ca
Saves 88 bytes of flash and costs 28 bytes of RAM.
Hi all,
did you ever think about using uint_fast8_t? Any drawbacks? I just replaced only the variables which are going in the stepper interrupt. On the max time this becomes a performance boost of ~3.5% (280 ticks vs 270 ticks).
https://gist.github.com/Wurstnase/5bc0bcbf08c7a40acded14bf30610704