Replace jump branch with skip branch optimization

nt314p commented 1 month ago

One other optimization you can do is eliminating some expensive jumps when setting the data register. This does require some inline assembly, but only to prevent the compiler from optimizing it out. The assembly forces the compiler to generate an alternative optimization (a skip branch instruction), which is actually faster.

uint8_t d = *localDataOutRegister;
d |= outmask1; // always set bit
asm volatile("" : "+r" (d)); // prevent compiler optimization. generates no assembly
if ((value & 0x01) == 0) d &= outmask2; // only reset if needed
*localDataOutRegister = d;

RobTillaart commented 1 month ago

reference = 0.4.0 version no loop unroll

Performance - time in us
        write: 15.34
        write: 29.43
        Delta: 14.10

Proposed optimization - only added in the not unrolled version for quick test.

Performance - time in us
        write: 15.08
        write: 28.92
        Delta: 13.84

shorter code is equally fast.

    //  process one bit
    uint8_t d = *localDataOutRegister | outmask1; // always set bit
    asm volatile("" : "+r" (d)); // prevent compiler optimization. generates no assembly
    if ((value & m) == 0) d &= outmask2; // only reset if needed
    *localDataOutRegister = d;

RobTillaart commented 1 month ago

@nt314p

(still not unrolled loop as that is easiest to patch)

Assuming that dataOutRegister does not change

clock bit is set back to what it was,
there are no interrupts.

  uint8_t d0 = *localDataOutRegister & outmask2;  //  cache 0
  uint8_t d1 = d0 | outmask1;                     //  cache 1
  for (uint8_t m = 1; m > 0; m <<= 1)
  {
    //  process one bit
    uint8_t d = d1;               //  always set bit
    asm volatile("" : "+r" (d));  //  prevent compiler optimization. generates no assembly
    if ((value & m) == 0) d = d0; //  only reset if needed
    *localDataOutRegister = d;
    // if ((value & m) == 0) *localDataOutRegister &= outmask2;
    // else                  *localDataOutRegister |= outmask1;
    uint8_t r = *localClockRegister;
    *localClockRegister = r | cbmask1;  //  set one bit
    *localClockRegister = r;            //  reset it
  }

==>

Performance - time in us
        write: 14.33
        write: 27.42
        Delta: 13.09

looks like an improvement, can you shoot a hole in it or is it stable?

RobTillaart commented 1 month ago

@nt314p

A step back, no asm required.

  uint8_t d0 = *localDataOutRegister & outmask2;  //  cache 0
  uint8_t d1 = d0 | outmask1;                     //  cache 1
  for (uint8_t m = 1; m > 0; m <<= 1)
  {
    //  process one bit
    if ((value & m) == 0) *localDataOutRegister = d0;
    else *localDataOutRegister = d1;
    // if ((value & m) == 0) *localDataOutRegister &= outmask2;
    // else                  *localDataOutRegister |= outmask1;
    uint8_t r = *localClockRegister;
    *localClockRegister = r | cbmask1;  //  set one bit
    *localClockRegister = r;            //  reset it
  }

==>

Performance - time in us
        write: 14.08
        write: 26.91
        Delta: 12.83

From 14.10 => 12.83 means the unrolled loop could go from 11.51 => 10.2 something.

Opinion?

nt314p commented 1 month ago

Oh yep shortening the code like that works too. I can rework my assembly to take into account your proposed optimization, as it is currently clashing with it I believe.

One thing would be to initialize the clock to 0. This is so that if the data and clock lines are on the same register, when you cache the data you also have clock low.

RobTillaart commented 1 month ago

The clock is set low in the constructor. It is changed to high and back to low after data is set. So the start condition of the loop for relevant pins is restored correctly.(Unless I missed something)

If time permits I will check with scope again.

RobTillaart commented 1 month ago

FYI Had a look on the scope this morning and the signals looked similar to those in #15.

RobTillaart commented 1 week ago

@nt314p

Hi Nick,

Found some time to create a develop branch + PR for the last optimization. If you have time, please have a look at the develop branch.

Thanks!

RobTillaart / FastShiftOut

Replace jump branch with skip branch optimization #17