[FR] General TFT Speed Improvements

bigtreetech / BIGTREETECH-TouchScreenFirmware

support TFT35 V1.0/V1.1/V1.2/V2.0/V3.0, TFT28, TFT24 V1.1, TFT43, TFT50, TFT70

GNU General Public License v3.0

1.3k stars 1.65k forks source link

[FR] General TFT Speed Improvements #2836

Closed rondlh closed 7 months ago

rondlh commented 1 year ago

I want to create a central space for general speed improvement discussions, hopefully resulting in a PR.

To start I want to raise some topics that perhaps can be applied to all applicable areas of the TFT. I can support the proposals with benchmarks to see if, and how much improvement they bring. Of course the most frequently visited code is most interesting in this case (loopProcess!). Let me start with raising some topics, please add yours below...

Calling a function, and then in the function check if the call is actually needed is quite inefficient, especially if you have to pass a lot of parameters to the function. It is much faster to first check if the call is actually needed, before calling the function. Applying this idea to Serial_Get can increase the loopProcess scan rate by about 20%.
8, 16 bits variables vs 32 bit variables. All here supported MCUs are 32 bit, and generally using 32 bits variables is recommended even if an 8 bit variable is enough, this is because most of the assembler instructions are 32 bit, so 8 and 16 bit registers need to be "emulated", which means that actually a 32 bit variable is used, that simulates the behavior of a 8 or 16 variable. This of course causes overhead. A quick test on a prototype version of Serial_Get showed that converting the seven 16-bit variables to 32 bit increased the speed by 9%. Not bad considering that it's such an easy task to replace a few variables declarations.
@kisslorand showed that the % operator (modulo) is quite an expensive operator. This is because this operator needs an integer division which is computational intensive. I'm aware of 2 solutions: 3A. @kisslorand showed that in some cases a counter can be used, which is faster 3B. If a value of 2 to the power of X can be used (2, 4, 8, 32...), then the module operator is not expensive anymore. In example 3A we needed to count to 1000, but if this could be converted to 1024 then the problem would be gone. This is useful for buffer memories! 3C. Also counting down to 0 if faster than the other way around, I found a speed difference of 15% when only considering the counting part. This is the case because there are specific assembler instruction for comparing to 0, but not for comparing to 1000. So for example:

for (int a = 0; i < 1000; i++) { do your task 1000x times } // counting up is slower than

for (int a = 1000; i > 0; i--) { do your task 1000x times } // counting down to 0

So for example if you want to make a string uppercase, then do it from the back to the front, not from the front to the back.

I found many cases in the code that look like this if (!x) { do A } else { do B } Faster, and easier to understand is: (UPDATE: This is incorrect, this could be faster or slower, see below why) if (x) { do B } else { do A } Touch_Enc_ReadPen (Touch_Encoder.c) Touch_Enc_ReadBtn Touch_Enc_ReadPos zOffsetNotifyError zOffsetDraw menuZOffset (many!) loopCheckHeater

UPDATE: After some tests I found that for speed it's best to put the most common case on top, like this: if (condition) { this happens 90% of the time } else { this happens 10% of the time } You might actually add the inversion in the guard to put the most common case on top to increase speed.

Side quest for readability: (a != true) ---> (a == false) (a != false) ---> (a == true) Here a very creative example (print.c) if (infoHost.connected != true || enterFolder(infoFile.file[infoFile.fileIndex]) == false)

General algorithm improvements
Event driven code instead of polling.

rondlh commented 1 year ago

Here a type 5. ORIGINAL CODE(with timer rollover bugfix causing "next" actually to be "previous" timestamp: NEXT_FAN_WAIT = 500

This task in in loopProcess so it's executed tens of thousands of time a second. FanControl.c

static uint32_t nextCtrlFanTime = 0;

void loopFan(void)
{
  for (uint8_t i = 0; i < MAX_FAN_COUNT; i++)
  {
    if (GET_BIT(needSetFanSpeed, i) && (OS_GetTimeMs() - nextCtrlFanTime > NEXT_FAN_WAIT))
    {
      if (storeCmd(fanCmd[i], setFanSpeed[i]))
      {
        SET_BIT_OFF(needSetFanSpeed, i);
      }
      nextCtrlFanTime = OS_GetTimeMs();  // avoid rapid fire, clogging the queue
    }
  }
}

This task will most of the time be aborted because of the timeout. So let's test that first before starting to loop though the fans.

void loopFan(void)
{
  if (OS_GetTimeMs() - nextCtrlFanTime > NEXT_FAN_WAIT)
  for (uint8_t i = 0; i < MAX_FAN_COUNT; i++)
  {
    if (GET_BIT(needSetFanSpeed, i)
    {
      if (storeCmd(fanCmd[i], setFanSpeed[i]))
      {
        SET_BIT_OFF(needSetFanSpeed, i);
      }
      nextCtrlFanTime = OS_GetTimeMs();  // avoid rapid fire, clogging the queue
    }
  }
}

rondlh commented 1 year ago

SpeedControl.c Same issue as loopFan above.

NEXT_SPEED_WAIT = 500

static uint32_t nextSpeedTime = 0;
void loopSpeed(void)
{
  for (uint8_t i = 0; i < SPEED_NUM; i++)
  {
    if (infoSettings.ext_count == 0 && i > 0)
    {
      // Don't poll M221 if there are no extruders
      continue;
    }

    if (GET_BIT(needSetPercent, i) && (OS_GetTimeMs() - nextSpeedTime > NEXT_SPEED_WAIT))
    {
      if (storeCmd("%s S%d D%d\n", speedCmd[i], setPercent[i], heatGetToolIndex()))
      {
        SET_BIT_OFF(needSetPercent, i);
      }
      nextSpeedTime = OS_GetTimeMs();  // avoid rapid fire, clogging the queue
    }
  }
}

Again, a lot of work is done before checking if it was actually needed.

void loopSpeed(void)
{
  if OS_GetTimeMs() - nextSpeedTime > NEXT_SPEED_WAIT)
  for (uint8_t i = 0; i < SPEED_NUM; i++)
  {
    if (infoSettings.ext_count == 0 && i > 0)
    {
      // Don't poll M221 if there are no extruders
      continue;
    }

    if (GET_BIT(needSetPercent, i))
    {
      if (storeCmd("%s S%d D%d\n", speedCmd[i], setPercent[i], heatGetToolIndex()))
      {
        SET_BIT_OFF(needSetPercent, i);
      }
      nextSpeedTime = OS_GetTimeMs();  // avoid rapid fire, clogging the queue
    }
  }
}

Additional: Both loopFan and loopSpeed become active at the same time, they are in sync, causing loopProcess to suddenly be slower when this happens. It's probably better to let them run out of sync.:

static uint32_t nextSpeedTime = 250;

This probably happens at more places... it's best to separate them in time to prevent peaks in the workload.

rondlh commented 1 year ago

Case 3: buzzer.c

#define BUZZER_CACHE_SIZE 5

The value should be 4 or 8, not 5 because a lot of modulo BUZZER_CACHE_SIZE is used.

rondlh commented 1 year ago

Task priorities

If you look to loopProcessand loopBackEndand loopFrontEnd then you can see that everything in there has the same priority, including all the processes in these 2 functions.

void loopProcess(void)
{
  loopBackEnd();
  loopFrontEnd();
}

Everything is executed tens of thousands of times a second. For the back end that is probably useful, but for the front end it is not needed, let's say 200 updates a second is enough, this could be implemented like this:

#define FRONTEND_INTERVAL  (5)   // delay in ms for front end calls, lower is more frequent
static uint32_t lastLoopFrontEnd = 0;

void loopProcess(void)
{
  loopBackEnd();

  if (OS_GetTimeMs() - lastLoopFrontEnd > FRONTEND_INTERVAL)
  {
    loopFrontEnd();
    lastLoopFrontEnd = OS_GetTimeMs();
  }
}

This will increase the scan rate to 60K/s. Not bad considering we started at 35K. Further refinements could be added to both the loopBackEnd and loopFrontEnd. Something like this could be done:

static uint32_t counter = 0;
loopBackEnd() {
  important_process_1; // Run 100% of the time
  important_process_2;
  important_process_3;

  if (counter++ % 2) return;  // Run 50% of the time
  less_important_process_1;
  less_important_process_2;
  less_important_process_3;

  if ((counter % 4)) return;  // Run 25% of the time
  non_time_critical_process_1;
  non_time_critical_process_2;
  non_time_critical_process_3;
}

This way we can make sure the important processes get the most attention.

Update: Some discussion is needed about what is the best way to this, what to move up and down, and what percentages to use. For a quick test I added 1 line in loopBackEnd, this significantly focuses the attention to the core of the TFT task. (>100K scans/sec).

  // Retrieve and store (in command queue) the gcodes received from other UART, such as ESP3D etc...
  #ifdef SERIAL_PORT_2
    Serial_GetFromUART();
  #endif

  // Check changes in encoder steps
  #if LCD_ENCODER_SUPPORT
    #ifdef HAS_EMULATOR
      if (MENU_IS_NOT(menuMarlinMode))
    #endif
    {
      LCD_Enc_CheckSteps();
    }
  #endif

  if ((counter++ % 16)) return;  // IRON, Run 6% of the time only

  // Temperature monitor
  loopCheckHeater();

  // Fan speed monitor
  loopFan();

  // Speed & flow monitor
  loopSpeed();

  // Buzzer handling
  #ifdef BUZZER_PIN
   loopBuzzer(); 
  #endif

digant73 commented 1 year ago

Task priorities

If you look to loopProcessand loopBackEndand loopFrontEnd then you can see that everything in there has the same priority, including all the processes in these 2 functions.
void loopProcess(void)
{
  loopBackEnd();
  loopFrontEnd();
}
Everything is executed tens of thousands of times a second. For the back end that is probably useful, but for the front end it is not needed, let's say 200 updates a second is enough, this could be implemented like this:
#define FRONTEND_INTERVAL  (5)   // delay in ms for front end calls, lower is more frequent
static uint32_t lastLoopFrontEnd = 0;

void loopProcess(void)
{
  loopBackEnd();

  if (OS_GetTimeMs() - lastLoopFrontEnd > FRONTEND_INTERVAL)
  {
  loopFrontEnd();
  lastLoopFrontEnd = OS_GetTimeMs();
  }
}
This will increase the scan rate to 60K/s. Not bad considering we started at 35K. Further refinements could be added to both the loopBackEnd and loopFrontEnd. Something like this could be done:
static uint32_t counter = 0;
loopBackEnd() {
  important_process_1; // Run 100% of the time
  important_process_2;
  important_process_3;

  if (counter++ % 2) return;  // Run 50% of the time
  less_important_process_1;
  less_important_process_2;
  less_important_process_3;

  if ((counter % 4)) return;  // Run 25% of the time
  non_time_critical_process_1;
  non_time_critical_process_2;
  non_time_critical_process_3;
}
This way we can make sure the important processes get the most attention.

Update: Some discussion is needed about what is the best way to this, what to move up and down, and what percentages to use. For a quick test I added 1 line in loopBackEnd, this significantly focuses the attention to the core of the TFT task. (>100K scans/sec).
  // Retrieve and store (in command queue) the gcodes received from other UART, such as ESP3D etc...
  #ifdef SERIAL_PORT_2
    Serial_GetFromUART();
  #endif

  // Check changes in encoder steps
  #if LCD_ENCODER_SUPPORT
    #ifdef HAS_EMULATOR
      if (MENU_IS_NOT(menuMarlinMode))
    #endif
    {
      LCD_Enc_CheckSteps();
    }
  #endif

  if ((counter++ % 16)) return;  // IRON, Run 6% of the time only

  // Temperature monitor
  loopCheckHeater();

  // Fan speed monitor
  loopFan();

  // Speed & flow monitor
  loopSpeed();

  // Buzzer handling
  #ifdef BUZZER_PIN
   loopBuzzer(); 
  #endif

this is probably the most useful/needed and easy to implement enhancement. Instead of % I would use the most efficient & operator. E.g

if ((counterBE++ & 0x000F) != 0) retrun

instead of if ((counter++ % 16)) ...

I would apply the same on loopProcess():

void loopProcess(void)
{
  loopBackEnd();

  if ((counteFE++ & 0x000F) != 0) return;

  loopFrontEnd();
}

rondlh commented 1 year ago

This is basically point 3B.

3B. If a value of 2 to the power of X can be used (2, 4, 8, 32...), then the module operator is not expensive anymore.

My (untested) assumption is that if you use a number of 2 to the power of X, that the compiler will apply the substitution you show... but let me test it to know for sure.

void loopProcess(void)
{
  loopBackEnd();

  if ((counteFE++ & 0x000F) != 0) return;

  loopFrontEnd();
}

UPDATE: I don't even have to benchmark this, the FW files are identical.

rondlh commented 1 year ago

I did some test to confirm the impact of point 4. if (!x) { do A } else { do B } Faster, and easier to understand is: if (x) { do B } else { do A }

I found a speed difference of 28%, but not in the way I was expecting it. I want to test how much impact the guard has, so I minimized the work done by task A and task B. (A = "a++", B = "a--");

What I found is that a guard that is true is 28% faster than a guard that is false in this case. This is probably because if the guard is false the routine needs to jump to the "false" case, which is not needed if you need the "true" case.

Let me confirm this with a real task which is executed 1000 times a second.

Original code:

void loopTouchScreen(void)  // Handle in interrupt
{
  static uint8_t touch;
  if (!XPT2046_Read_Pen())
  {
    if (touch >= 20)  // 20ms
    {
      touchScreenIsPress = true;
    }
    else
    {
      touch++;
    }
  }
  else
  {
    touchScreenIsPress = false;
    touch = 0;
  }
}

Here is makes sense to swap the cases, because the screen is almost never pressed.

Removed guard inversion, swapped cases:

void loopTouchScreen(void)  // Handle in interrupt
{
  static uint8_t touch;
  if (XPT2046_Read_Pen())
  {
    touchScreenIsPress = false;
    touch = 0;
  }
  else
  {
    if (touch >= 20)  // 20ms
    {
      touchScreenIsPress = true;
    }
    else
    {
      touch++;
    }
  }
}

This swap increases the algorithm speed by 6%.

Conclusion: For speed put the most commonly occurring case on top.

kisslorand commented 1 year ago

In the OP you should also add at point 3 that a comparison with 0 (zero) is the fastest for several reasons:

Zero Register: Many processor architectures, including Cortex ARM MCUs, have dedicated hardware for comparing a register or memory location to zero. This is because zero is a common value for comparison, especially in conditional branches (e.g., "branch if equal to zero"). Having dedicated hardware for zero comparisons allows for faster execution because it eliminates the need to load zero from memory or another register before performing the comparison.
Conditional Branching: In many programming constructs, you need to make decisions based on the result of a comparison, such as branching to different code paths if a value is greater than, equal to, or less than zero. The zero flag or zero condition code is often readily available in the processor's status flags, which can be used for conditional branching. This simplifies the branch decision-making process and can lead to faster execution when comparing to zero.
Simplified Arithmetic Operations: Some instructions in processor architectures are optimized for zero-based operations. For example, subtracting zero from a value or adding zero to a value doesn't change the value itself, and processors may have shortcuts or optimizations for these operations.
Compiler Optimization: Compilers are often designed to generate efficient machine code. When they encounter comparisons to zero, they may generate more efficient code sequences because zero comparisons are so common. This can result in faster execution even without explicit programmer optimization.
Cache Behavior: Zero is often used as a "magic number" in code for various purposes. When zero values are encountered frequently, they may be cached in memory or registers, making subsequent comparisons to zero faster due to cache hit rates.
Instruction Pipelining: In some processor architectures, the instruction pipeline may be optimized for specific operations, including comparisons to zero. Processors may be able to predict the outcome of such comparisons early in the pipeline, allowing for faster execution.

rondlh commented 1 year ago

In the OP you should also add at point 3 that a comparison with 0 (zero) is the fastest for several reasons:

1. _**Zero Register**_: Many processor architectures, including Cortex ARM MCUs, have dedicated hardware for comparing a register or memory location to zero. This is because zero is a common value for comparison, especially in conditional branches (e.g., "branch if equal to zero"). Having dedicated hardware for zero comparisons allows for faster execution because it eliminates the need to load zero from memory or another register before performing the comparison.

2. _**Conditional Branching**_: In many programming constructs, you need to make decisions based on the result of a comparison, such as branching to different code paths if a value is greater than, equal to, or less than zero. The zero flag or zero condition code is often readily available in the processor's status flags, which can be used for conditional branching. This simplifies the branch decision-making process and can lead to faster execution when comparing to zero.

3. _**Simplified Arithmetic**_ Operations: Some instructions in processor architectures are optimized for zero-based operations. For example, subtracting zero from a value or adding zero to a value doesn't change the value itself, and processors may have shortcuts or optimizations for these operations.

4. _**Compiler Optimization**_: Compilers are often designed to generate efficient machine code. When they encounter comparisons to zero, they may generate more efficient code sequences because zero comparisons are so common. This can result in faster execution even without explicit programmer optimization.

5. _**Cache Behavior**_: Zero is often used as a "magic number" in code for various purposes. When zero values are encountered frequently, they may be cached in memory or registers, making subsequent comparisons to zero faster due to cache hit rates.

6. _**Instruction Pipelining**_: In some processor architectures, the instruction pipeline may be optimized for specific operations, including comparisons to zero. Processors may be able to predict the outcome of such comparisons early in the pipeline, allowing for faster execution.

Good points, perhaps you can find some concrete examples of this in the code base? I recall that I compared if counting from 1000 to 0 is faster than counting from 0 to 1000 (to go from ms to seconds). I believe I found that counting down is about 15% faster (only considering the counting part). That is indeed because of the comparison with 0. There are specific assembler instruction that just work on 1 MCU register (not on 2) which makes it faster.

kisslorand commented 1 year ago

3. 3B. If a value of 2 to the power of X can be used (2, 4, 8, 32...), then the module operator is not expensive anymore. In example 3A we needed to count to 1000, but if this could be converted to 1024 then the problem would be gone. This is useful for buffer memories!

I was already contemplating using TIM7 to count 1024 for every second. It would be very beneficial to count the seconds as a % 1024 equals a & 0x0400, the latter being superior as execution time, the compiler would also optimize anyway the former and use the bitwise operations in the compiled FW. My problem is what to do with macros defined as milliseconds.

Not the prettiest looking but it can work :

#define _MS(a) ((a * 1024) / 1000) 
#define MACRO1_TIME _MS(500)
#define MACRO2_TIME _MS(300)
....

Of course it will be tricky to set the prescaler and the counter of the timer. We have multiple of 1.000.000 as MCU frequency which is not really dividable by 1024.

rondlh commented 1 year ago

3B. If a value of 2 to the power of X can be used (2, 4, 8, 32...), then the module operator is not expensive anymore. In example 3A we needed to count to 1000, but if this could be converted to 1024 then the problem would be gone. This is useful for buffer memories!

I was already contemplating using TIM7 to count 1024 for every second. It would be very beneficial to count the seconds as a % 1024 equals a & 0x0400, the latter being superior as execution time, the compiler would also optimize anyway the former and use the bitwise operations in the compiled FW. My problem is what to do with macros defined as milliseconds.

Not the prettiest looking but it can work :
#define _MS(a) (a * 1024) / 1000) 
#deine  MACRO1_TIME _MS(500)
#define MACRO2_TIME _MS(300)

I personally would not go in this direction, not for time at least, because of 3 reasons:

Underlying algorithms, especially in the HAL, might depend on accurate "ticks" of 1 ms.
This basically happens only 1000x times a second, which is not very much, so the total impact will be almost nothing. Your counting solution already helps, but even that, if you would benchmark it to the TFT scan rate impact would probably within the noise margin.
Time must stay accurate, my (modified) TFT is aware of the current time and tells me when a print will be finished when I start a print. This is also not super accurate of course, but I would not like to see that the TFT time is several minutes wrong after a day or 2. Update, actually time should stay accurate, only "ticks" could be affected.

kisslorand commented 1 year ago

HAL is not using TIM7, at least not that I know of. Anyway it was just an idea I was playing with but didn't find a reasonable solution. Not a biggie, I am just sharing my thoughts on the OP subject.

rondlh commented 1 year ago

HAL is not using TIM7, at least not that I know of. Anyway it was just an idea I was playing with but didn't find a reasonable solution. Not a biggie, I am just sharing my thoughts on the OP subject.

I don't know the details, but I'm sure that HAL uses "ticks" left and right, for initialization, and delays and such. If suddenly a ticks is not a ms anymore then that could cause issues. Or do you mean you want to use an additional timer for this purpose?

kisslorand commented 1 year ago

I mean only that HAL is not using TIM7 as far as I know, my idea was to play only with TIM7 to configure it to 1024 counts for every second so it would be easy-peasy to check if a second has passed. I am really not aware of any function in HAL using TIM7.

rondlh commented 1 year ago

Case 6: Polling vs. event driven processes

M300 (beep) buzzer.

POLLING METHOD (Current implementation) A M300 command stores the sound data (frequency, duration) in a queue and exits. Then loopProcess will poll the buzzer code and when data is found, the TFT will play the sound, this is implemented with interrupts toggling the speaker on and off at a high frequency. When the sound is done, the sound is removed from the queue, and the next sound can be picked up by the polling round.

EVENT DRIVEN APPROACH (Under development) A M300 command stores the sound info in a queue, AND starts the processing of the first sound. When the sound is done, the sounds is removed from the queue AND the next sound is started. This way the sound processing does not need to be in the main loop process. Everything happens in the background with interrupts anyway, so the final interrupt call can start the next sound if any more sounds need to be played.

Note: I have this implemented already and I use PWM (STMF2_F4) to generate the sound, not interrupts. Unfortunately some TFTs (e.a. BTT TFT35 V3.0) have made poor choices and use the same timer( TIM4 Chan 1-2) for the speaker and the backlight dimming. This means that you cannot create sounds and control the TFT backlight at the same time.

This event driven approach is also used for the DMA serial writing and Interrupt serial writing. No task needs to be added to the loopProcess, the arrival of new data to be transmitted (Serial_Put) is the event that triggers the start of the serial send process. There might be more tasks that can be changed this way.

rondlh commented 1 year ago

I just reviewed the current buzzer code and came across this jewel...

  while (buzzer.count == BUZZER_CACHE_SIZE)
  { 
    loopBuzzer();
  }

Very smart, when the sound queue is full, pause the printing so you can listen to the sounds.

github-actions[bot] commented 4 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.