Closed nt314p closed 1 week ago
reference = 0.4.0 version no loop unroll
Performance - time in us
write: 15.34
write: 29.43
Delta: 14.10
Proposed optimization - only added in the not unrolled version for quick test.
Performance - time in us
write: 15.08
write: 28.92
Delta: 13.84
shorter code is equally fast.
// process one bit
uint8_t d = *localDataOutRegister | outmask1; // always set bit
asm volatile("" : "+r" (d)); // prevent compiler optimization. generates no assembly
if ((value & m) == 0) d &= outmask2; // only reset if needed
*localDataOutRegister = d;
@nt314p
(still not unrolled loop as that is easiest to patch)
Assuming that dataOutRegister does not change
uint8_t d0 = *localDataOutRegister & outmask2; // cache 0
uint8_t d1 = d0 | outmask1; // cache 1
for (uint8_t m = 1; m > 0; m <<= 1)
{
// process one bit
uint8_t d = d1; // always set bit
asm volatile("" : "+r" (d)); // prevent compiler optimization. generates no assembly
if ((value & m) == 0) d = d0; // only reset if needed
*localDataOutRegister = d;
// if ((value & m) == 0) *localDataOutRegister &= outmask2;
// else *localDataOutRegister |= outmask1;
uint8_t r = *localClockRegister;
*localClockRegister = r | cbmask1; // set one bit
*localClockRegister = r; // reset it
}
==>
Performance - time in us
write: 14.33
write: 27.42
Delta: 13.09
looks like an improvement, can you shoot a hole in it or is it stable?
@nt314p
A step back, no asm required.
uint8_t d0 = *localDataOutRegister & outmask2; // cache 0
uint8_t d1 = d0 | outmask1; // cache 1
for (uint8_t m = 1; m > 0; m <<= 1)
{
// process one bit
if ((value & m) == 0) *localDataOutRegister = d0;
else *localDataOutRegister = d1;
// if ((value & m) == 0) *localDataOutRegister &= outmask2;
// else *localDataOutRegister |= outmask1;
uint8_t r = *localClockRegister;
*localClockRegister = r | cbmask1; // set one bit
*localClockRegister = r; // reset it
}
==>
Performance - time in us
write: 14.08
write: 26.91
Delta: 12.83
From 14.10 => 12.83 means the unrolled loop could go from 11.51 => 10.2 something.
Opinion?
Oh yep shortening the code like that works too. I can rework my assembly to take into account your proposed optimization, as it is currently clashing with it I believe.
One thing would be to initialize the clock to 0. This is so that if the data and clock lines are on the same register, when you cache the data you also have clock low.
The clock is set low in the constructor. It is changed to high and back to low after data is set. So the start condition of the loop for relevant pins is restored correctly.(Unless I missed something)
If time permits I will check with scope again.
FYI Had a look on the scope this morning and the signals looked similar to those in #15.
@nt314p
Hi Nick,
Found some time to create a develop branch + PR for the last optimization. If you have time, please have a look at the develop branch.
Thanks!
One other optimization you can do is eliminating some expensive jumps when setting the data register. This does require some inline assembly, but only to prevent the compiler from optimizing it out. The assembly forces the compiler to generate an alternative optimization (a skip branch instruction), which is actually faster.