bigtreetech / BIGTREETECH-TouchScreenFirmware

support TFT35 V1.0/V1.1/V1.2/V2.0/V3.0, TFT28, TFT24 V1.1, TFT43, TFT50, TFT70
GNU General Public License v3.0
1.32k stars 1.65k forks source link

[FR] Improved serial data processing (smarter reading, faster writing) #2835

Closed rondlh closed 1 year ago

rondlh commented 1 year ago

1. Serial reading The current FW uses DMA for reading data from the serial ports, which is done in a very efficient way, the process is started and will run automatically in the background. Interrupts are generated when data is available, this happens when the serial line goes idle. This is overall a very efficient approach, but has a small drawback. The received data only gets processed once the serial line goes idle, under most circumstances this is fine because the received messages are short and fragmented ("ok\n"). But if a long continues message is received, like the response to M43 (pins information), then a TFT buffer overrun can occur and a part of the received message is lost. This issue can be solved by not waiting for the serial idle interrupt, but immediately starting to process the data that is being received. Based on the improvements in #2824 (improvements in Serial_Get) it is quite obvious how to implement this. I will provide some sample code soon.

The advantages:

2. Serial writing Practically the TFT only needs to receive a little bit of data ("ok\n" messages), but needs to write a lot of data to the motherboard (the gcode commands, about 40 characters per command). The current serial write implementation is very slow, the software actually needs to wait after each byte until it's physically send over the serial line. This slows down the TFT response time, especially for lower baud rates and when the workload is high anyway. A buffered DMA based would significantly help to increase the TFT responsiveness.

Currently I have a DMA based serial write solution under test, which works in this way:

The advantages:

A disadvantage is that this code is hardware dependent (DMA setup). I will post the STM32F2_4 sample code here soon for review and discussion.

I have some questions:

  1. Is it acceptable to provide this solution for one hardware platform only?

  2. Who can help test/implement a STM32F10x implementation? BTT TFT24 V1.1 BTT TFT28 V1.0 BTT TFT35 1.0/1.1/1.2/2.0 BTT GD TFT24 V1.1 BTT GD TFT35 V2.0 MKS TFT28 V3.0/4.0 MKS TFT28 NEW GENIUS MKS TFT32 V1.3/1.4 MKS TFT32L V3

  3. Who can help test/implement a gd32f20x implementation? BTT GD TFT35 V3.0 BTT GD TFT35 E3 V3.0 BTT_GD TFT35 B1 V3.0 BTT GD TFT43 V3.0 BTT GD TFT43 V3.0 BTT GD TFT50 V3.0

rondlh commented 1 year ago

I updated the code a bit, see below. Please make sure to #define IDLE_INTERRUPT if you use the idle interrupt, or just remove the "#ifdef IDLE_INTERRUPT" and "#endif"

QUESTIONS: There is a blocking line in Serial_Get: // wait until space becomes available, blocking! while ( ( (dmaL1DataTX[port].wIndex + 1) % dmaL1DataTX[port].cacheSize) == dmaL1DataTX[port].rIndex) { } This line makes the TFT wait for buffer space if it's not available. I wonder if I should put any process in there that could be done safely during the waiting time. adding loopProcess would be a recursive call I guess, but perhaps Serial_Read can continue. Any suggestions?

void Serial_Put(uint8_t port, const char *s) // send a zero terminated string to uart port
{
  while (*s)
  { // copy serial characters to buffer until end of string '/0'

    // setup TX DMA, if no '\n' yet, but buffer is full AND DMA is not in progress already (waiting for Transfer Complete interrupt)
    if ((((dmaL1DataTX[port].wIndex + 1) % dmaL1DataTX[port].cacheSize) == dmaL1DataTX[port].rIndex) && (((Serial[port].uart->CR1 & (1<<6)) == 0)))
      Serial_Send_TX(port);

    // wait until space becomes available, blocking!
    while ( ( (dmaL1DataTX[port].wIndex + 1) % dmaL1DataTX[port].cacheSize) == dmaL1DataTX[port].rIndex) { }

    dmaL1DataTX[port].cache[dmaL1DataTX[port].wIndex] = *s; // copy character to cache
    dmaL1DataTX[port].wIndex = (dmaL1DataTX[port].wIndex + 1) % dmaL1DataTX[port].cacheSize; // update wIndex

    if ((*s == '\n') && (((Serial[port].uart->CR1 & (1<<6)) == 0)))
      Serial_Send_TX(port); // start DMA process if command is complete and DMA is not in progress already
    s++; // let the compiler optimize this, no need to do it manually!
  }
}

And the ISR looks like shown below: Note the #ifdef is for testing only, so I can compare:

  1. Using or disabling the Serial idle interrupt
  2. test unbuffered writing
  3. interrupt based writing
  4. DMA based writing.
void USART_IRQHandler(uint8_t port) // ISR, serial interrupt handler
{
#ifdef IDLE_INTERRUPT
  // handle serial idle interrupt
  if ((Serial[port].uart->SR & (1<<4)) != 0) // RX: check for serial Idle interrupt
  {
  //  Serial[port].uart->DR; // done in the guard above already
    Serial[port].uart->DR;
    dmaL1DataRX[port].wIndex = dmaL1DataRX[port].cacheSize - Serial[port].dma_streamRX->NDTR;
  }
#endif

#ifdef INTERRUPT_WRITE // interrupt based serial writing
  if ((Serial[port].uart->SR & (1<<7)) != 0) // check for TXE interrupt
  {
    if (dmaL1DataTX[port].rIndex == dmaL1DataTX[port].wIndex) // no more data?
      Serial[port].uart->CR1 &= ~(1<<7); // clear TXE Interrupt bit  
    else
    { 
      Serial[port].uart->DR = (uint8_t)dmaL1DataTX[port].cache[dmaL1DataTX[port].rIndex];      // write next available character
      dmaL1DataTX[port].rIndex = (dmaL1DataTX[port].rIndex + 1) % dmaL1DataTX[port].cacheSize; // Increase read index
    }
  }
#endif

#ifdef DMA_WRITE // DMA based serial writing
  if ((Serial[port].uart->SR & (1<<6)) != 0) // TX: check for Transfer Complete (TC)
  {
    Serial[port].uart->SR &= ~(1<<6); // clear Transfer Complete (TC) bit 

    // NOTE: the ISR is sometimes called while DMA is still active, so check NDTR status!
    if (Serial[port].dma_streamTX->NDTR == 0) // sending is complete
    {
      dmaL1DataTX[port].rIndex = (dmaL1DataTX[port].rIndex + Serial[port].txBytes[port]) % dmaL1DataTX[port].cacheSize;
      Serial[port].txBytes[port] = 0;

      if (dmaL1DataTX[port].wIndex != dmaL1DataTX[port].rIndex) { // is more data available?
        Serial_Send_TX(port); // continue sending data
      }
      else { // removed causes double line transfers
       // dmaL1DataTX[port].rIndex = dmaL1DataTX[port].wIndex = 0; // buffer is empty, start from the start to keep transfers in 1 DMA cycle if possible
        Serial[port].uart->CR1 &= ~(1<<6);  // disable Transfer Complete (TC) interrupt, nothing more to do
      }
    }
//     else: more data is coming, wait for next Transfer Complete (TC) interrupt
  }
#endif 
}

And the DMA send code:

void Serial_Send_TX(uint8_t port) 
{ 
  // setup DMA transfer, and wait for serial Transfer Complete (TX) interrupt in ISR
  if (dmaL1DataTX[port].wIndex >= dmaL1DataTX[port].rIndex)
    Serial[port].txBytes[port] = dmaL1DataTX[port].wIndex - dmaL1DataTX[port].rIndex;
  else
    Serial[port].txBytes[port] = dmaL1DataTX[port].cacheSize - dmaL1DataTX[port].rIndex; // split transfer into 2 parts

  Serial[port].dma_streamTX->M0AR = (uint32_t)(&dmaL1DataTX[port].cache[dmaL1DataTX[port].rIndex]); // start address TX data
  Serial_DMAClearInterruptFlagsTX(port); // clear DMA TX interrupt flags
  Serial[port].uart->CR1 |= (1<<6);      // enable Transfer Complete (TX) serial Interrupt
  Serial[port].dma_streamTX->NDTR = Serial[port].txBytes[port]; // no. bytes to transfer
  Serial[port].dma_streamTX->CR |= 1<<0; // enable TX DMA
}
digant73 commented 1 year ago

And the ISR looks like shown below: Note the #ifdef is for testing only, so I can compare:

  1. Using or disabling the Serial idle interrupt
  2. test unbuffered writing
  3. interrupt based writing
  4. DMA based writing.

What should I use preferably? #define DMA_WRITE?

digant73 commented 1 year ago

@rondlh In void Serial_ClearData(uint8_t port) you forgot to initialize flag, as you know used by Serial_Get() in dmaL1DataRX

Use this (similar to the current implementation):

void Serial_ClearData(uint8_t port)
{
  dmaL1DataRX[port].wIndex = dmaL1DataRX[port].rIndex = dmaL1DataRX[port].flag = dmaL1DataRX[port].cacheSize = 0;

  if (dmaL1DataRX[port].cache != NULL)
  {
    free(dmaL1DataRX[port].cache);
    dmaL1DataRX[port].cache = NULL;
  }

  dmaL1DataTX[port].wIndex = dmaL1DataTX[port].rIndex = dmaL1DataTX[port].flag = dmaL1DataTX[port].cacheSize = 0;

  if (dmaL1DataTX[port].cache != NULL)
  {
    free(dmaL1DataTX[port].cache);
    dmaL1DataTX[port].cache = NULL;
  }
}

Anyway, it seems none of the options for USART_IRQHandler() (e.g. #define IDLE_INTERRUPT) is working on my BTT TFT35. The TFT freezes at boot on BTT logo. Attached the Serial.h/c I'm using.

NOTE: I would rename your Get_wIndex() function to uint16_t Serial_GetWritingIndex(uint8_t port) just to maintain the naming convention used for all other functions in Serial API and SerialConnection API. Make it inline if you prefer. Also remember that Serial_PutChar() is used in RRFSendCmd.c.

Serial.zip

rondlh commented 1 year ago

What should I use preferably? #define DMA_WRITE?

Yes.

Good catch with the RX flag, TX doesn't use a flag, but I added it anyway.

Renamed: Get_wIndex to Serial_GetWritingIndex

Also remember that Serial_PutChar() is used in RRFSendCmd.c.

Correct, I rolled the PutChar into Serial_Put so there it is not needed anymore. The function still exists of course, but I need to update you on the latest version.

No issues in your serial files, all write modes should work.

Here my serial files: STM32F2_F4-Serial.zip

I see what's wrong, please define IDLE_INTERRUPT so the section in the ISR is enabled.

digant73 commented 1 year ago

ok, I will test it tonight. In Serial.c I would also move Serial_Send_TX before Serial_Put so you don't need to define it in Serial.h (it is only invoked internally) in Serial.c

rondlh commented 1 year ago

In Serial.c I would also move Serial_Send_TX before Serial_Put so you don't need to define it in Serial.h (it is only invoked internally) in Serial.c

OK, good idea. Done in serial.c and updated serial.h (Question, do you think it is difficult to implement sending the gcode with CRC check? I have some concerns about the serial data communication after my recent tests).

digant73 commented 1 year ago

OK, good idea. Done in serial.c and updated serial.h (Question, do you think it is difficult to implement sending the gcode with CRC check? I have some concerns about the serial data communication after my recent tests).

Do you mean to provide CRC such as "N1 G28*46\n"

rondlh commented 1 year ago

OK, good idea. Done in serial.c and updated serial.h (Question, do you think it is difficult to implement sending the gcode with CRC check? I have some concerns about the serial data communication after my recent tests).

Do you mean to provide CRC such as "N1 G28*46\n"

Yes, there is more to it... like a gcode resend I believe.

digant73 commented 1 year ago

to provide CRC you could do something like that:

static uint32_t line_number = 0;
char cmd_ptr[100];

stripChecksum(cmd);                     // e.g. "G28\n" -> "G28"
sprintf(cmd_ptr, "N%d %s", line_number++, cmd);                     // e.g. "G28" -> "N2 G28"
sprintf(&cmd_ptr[strlen(cmd_ptr)], "*%d\n", getChecksum(cmd_ptr));  // e.g. "N2 G28" -> "N2 G28*56\n"

where cmd is your plain gcode (e.g. G28\n), stripChecksum() and getChecksum() are already available in \TFT\src\User\my_misc.c

digant73 commented 1 year ago
void Serial_Send_TX(uint8_t port)
{
  // setup DMA transfer, and wait for serial Transfer Complete (TX) interrupt in ISR
  if (dmaL1DataTX[port].wIndex >= dmaL1DataTX[port].rIndex)
    Serial[port].txBytes[port] = dmaL1DataTX[port].wIndex - dmaL1DataTX[port].rIndex;
  else
    Serial[port].txBytes[port] = dmaL1DataTX[port].cacheSize - dmaL1DataTX[port].rIndex;  // split transfer into 2 parts

are you sure the above count is correct in case the transfer requires 2 parts? shouldn't it be (cacheSize - rIndex) + wIndex. Similar to what it is done in Serial_Get()

rondlh commented 1 year ago

to provide CRC you could do something like that:

Great, thanks, I tagged it onto SendCmd and that works already. For the DMA writing I've been logging the gcode commands in Marlin via the USB serial port, then comparing the data to the source gcode. I found some data corruption (sometimes 1 character is missing), but this turned out NOT to be related to the DMA wring, it always happens in my setup (BTT TFT35V3 + MKS Monster 8V1). At 250K baud it happens every 6-10MBytes, at 115200 baud it happens much less, but I still found some rare cases. Usually the data corruption doesn't create invalid gcode, so Marlin does not complain. It could be an artifact of how I test, I'm not sure. With this CRC code I can bring Marlin back to the normal behavior and see if any issues are reported. I worry I'm not the only one with this issue... I will investigate further.

digant73 commented 1 year ago

to provide CRC you could do something like that:

Great, thanks, I tagged it onto SendCmd and that works already. For the DMA writing I've been logging the gcode commands in Marlin via the USB serial port, then comparing the data to the source gcode. I found some data corruption (sometimes 1 character is missing), but this turned out NOT to be related to the DMA wring, it always happens in my setup (BTT TFT35V3 + MKS Monster 8V1). At 250K baud it happens every 6-10MBytes, at 115200 baud it happens much less, but I still found some rare cases. Usually the data corruption doesn't create invalid gcode, so Marlin does not complain. It could be an artifact of how I test, I'm not sure. With this CRC code I can bring Marlin back to the normal behavior and see if any issues are reported. I worry I'm not the only one with this issue... I will investigate further.

see my previous post about a possible bug managing the TX buffer

rondlh commented 1 year ago

are you sure the above count is correct in case the transfer requires 2 parts? shouldn't it be (cacheSize - rIndex) + wIndex.

The code has been tested a lot... should not have big mistakes. This is the transfer size of the first part that runs into the end of the buffer. So cacheSize - rIndex. There will be 2 DMA transfers. txBytes[port] keeps track of how many bytes have been send already.

Is DMA running?

digant73 commented 1 year ago

I'm at office

rondlh commented 1 year ago

About the CRC... It seems not very difficult. Send the Gcode like you indicated. Reset line number count with M110 N

Process error messages: Error:Line Number is not Last Line Number+1, Last Line: 123 Error: checksum mismatch, Last Line: 123 Error: No Checksum with line number, Last Line: 123 And resend request: Resend: 123

This would require the print to continue at line 123 (or 124?) This would probably require to keep an index of the last line number and the index into the file, so the lines can be resend: Line 120 --> file index 1000 Line 121 --> file index 1040 Line 122 --> file index 1082 Line 123 --> file index 1090 Line 124 --> file index 1130 Line 125 --> file index 1160 Line 126 --> file index 1170 Line 127 --> file index 1200 So printing must resume at line 123, file index 1090.

I think this is a useful feature, and perhaps the most reliable way to communicate. Programs like Pronterface and Octoprint use this strategy so they can recover from communication errors. Someone said: "You don't achieve reliable communication by eliminating errors - you achieve it by detecting and recovering from errors." That is a good point.

digant73 commented 1 year ago

tested the different solutions. I would say the most responsive solution on my BTT TFT35V3 is currently provided by:

#define IDLE_INTERRUPT
#define INTERRUPT_WRITE

With that, I'm able to forward even the entire M43 output (more than 10K ) to Wifi adapter. While with DMA and (obviously) default write I'm, most of the time, not able to get the proper output. The interrupt based write seems to me the balanced implementation. Just copying the message on TX buffer and starting the transfer giving the control to loopProcess() just to also allow the reading of other messages to possibly forward on another port (e.g. wifi).

If you implemented also the version for STM32F10X I could also test it on another TFT

rondlh commented 1 year ago

tested the different solutions. I would say the most responsive solution on my BTT TFT35V3 is currently provided by:

#define IDLE_INTERRUPT
#define INTERRUPT_WRITE

With that, I'm able to forward even the entire M43 output (more than 10K ) to Wifi adapter. While with DMA and (obviously) default write I'm, most of the time, not able to get the proper output. The interrupt based write seems to me the balanced implementation. Just copying the message on TX buffer and starting the transfer giving the control to loopProcess() just to also allow the reading of other messages to possibly forward on another port (e.g. wifi)

Interesting, which Serial_Get do you use?

digant73 commented 1 year ago

the last one with your changes (memcpy etc...) but all in one (no extra inline function). I used that:

uint16_t Serial_Get(SERIAL_PORT_INDEX portIndex, char * buf, uint16_t bufSize)
{
  // wIndex: update L1 cache's writing index (dynamically changed (by L1 cache's interrupt handler) variables/attributes)
  //         and make a static access (32 bit) to it to speedup performance on this function
  //
  uint32_t wIndex = dmaL1DataRX[portIndex].wIndex = Serial_GetWritingIndex(portIndex);  // get the latest wIndex
  uint32_t flag = dmaL1DataRX[portIndex].flag;                                          // get the current flag position

  if (flag == wIndex)  // if no data to read from L1 cache, nothing to do
    return 0;

  uint32_t cacheSize = dmaL1DataRX[portIndex].cacheSize;

  while (dmaL1DataRX[portIndex].cache[flag] != '\n' && flag != wIndex)  // check presence of "\n" in available data
  {
    flag = (flag + 1) % cacheSize;
  }

  if (flag == wIndex)  // if "\n" was not found (message incomplete), update flag and exit
  {
    dmaL1DataRX[portIndex].flag = flag;  // update queue's custom flag with flag (also equal to wIndex)

    return 0;
  }

  // rIndex: L1 cache's reading index (not dynamically changed (by L1 cache's interrupt handler) variables/attributes)
  //
  DMA_CIRCULAR_BUFFER * dmaL1Data_ptr = &dmaL1DataRX[portIndex];
  char * cache = dmaL1Data_ptr->cache;
  uint32_t rIndex = dmaL1Data_ptr->rIndex;

  while (cache[rIndex] == ' ' && rIndex != flag)  // remove leading empty space, if any
  {
    rIndex = (rIndex + 1) % cacheSize;
  }

  // msgSize: message size (after updating rIndex removing leading empty spaces). Last +1 is for the terminating null character '\0'
  uint32_t msgSize = (cacheSize + flag - rIndex) % cacheSize + 2;

  // if buf size is not enough to store the data plus the terminating null character "\0", skip the data copy
  //
  // NOTE: the following check should never be matched if buf has a proper size and there is no reading error.
  //       If so, the check could be commented out just to improve performance. Just keep it to make the code more robust
  //
  if (bufSize < msgSize)
    return 0;

  if (rIndex <= flag)  // data is one chunk only, from rIndex to flag
  {
    memcpy(buf, &cache[rIndex], msgSize - 1);
    buf += msgSize - 1;
  }
  else  // data at end and beginning of cache
  {
    memcpy(buf, &cache[rIndex], cacheSize - rIndex);
    buf += cacheSize - rIndex;

    memcpy(buf, cache, flag + 1);
    buf += flag + 1;
  }

  *buf = '\0';  // add end character

  // update queue's custom flag and reading index with next index
  dmaL1Data_ptr->flag = dmaL1Data_ptr->rIndex = (flag + 1) % cacheSize;

  return msgSize;  // return the number of bytes stored in buf
}
rondlh commented 1 year ago

With that Serial_Get it means that the Idle interrupt doesn't play much of a role anymore, because of: uint32_t wIndex = dmaL1DataRX[portIndex].wIndex = Serial_GetWritingIndex(portIndex);

The scanrate of Serial_Get is much faster than the byte receive rate of the serial line.

I'm surprised to see that you see a difference. Anyway, the Idle interrupt doesn't hurt anything.

I would expect that DMA writing would be best, but the interrupt is also fast and can loop around through the buffer without a new 2nd DMA process. I found that the DMA almost guarantees that the bytes are sent without "pause" between the bytes, only once in a while (half our or so) I found a pause, so I still had to handle that case. I would suggest to leave the code in place, and just select the one you think is best by using the #define

BTW: I could also program the reading to be interrupt based and check for '\n' and set a message counter or so to indicate the status. But a risk is that if you cannot keep up with the data reading then data could be lost. When using DMA you don't have that problem.

digant73 commented 1 year ago

I know the Idle interrupt is no more useful to update wIndex but I need it to avoid the freeze at boot.

I suggest to make the same simple test I made, Simply send an M43 from laptop to printer's wifi adapter and verify what you receive.

rondlh commented 1 year ago

I know the Idle interrupt is no more useful to update wIndex but I need it to avoid the freeze at boot.

I suggest to make the same simple test I made, Simply send an M43 from laptop to printer's wifi adapter and verify what you receive.

Freeze solution: uart.c, disable the idle interrupt, so you don't need it in the ISR anymore

#ifdef IDLE_INTERRUPT
    USART_ITConfig(uart[port], usart_it, ENABLE); 
    USART_ClearITPendingBit(uart[port], usart_it);
#endif

Yes, I can receive the M43 output, both with interrupt and with DMA... tried it a long time ago already. Now the TFT is much faster so it should work even better. I have a scan rate of 150K currently :D (It was 35K before). I even tried it while printing, that also works. With only the updated Serial_Get the scanrate is just below 60K/s.

Unrelated question: Do you know which function call takes the longest within the TFT? The initial draw of a menu?

digant73 commented 1 year ago

150K with only the new Serial_Get (or equivalent to the one I provided above) or you made also other changes (such as the ones proposed on the new thread you opened)? What TFT are you using?

rondlh commented 1 year ago

150K with only the new Serial_Get (or equivalent to the one I provided above) or you made also other changes (such as the ones proposed on the new thread you opened)? What TFT are you using?

With loopProcess changes... BTT TFT35 V3.0 I can show you how I test... quite easy with your monitoring menu...

digant73 commented 1 year ago

ok, a common benchmark can help

digant73 commented 1 year ago

Unrelated question: Do you know which function call takes the longest within the TFT? The initial draw of a menu?

hmm no, there a lot of functions invoked by loopProcess also making different things according to TFT status or configurations

rondlh commented 1 year ago
void parseACK(void)
{
  while (infoHost.counter++ && newDataAvailable(SERIAL_PORT) && (ack_len = Serial_Get(SERIAL_PORT, ack_cache, ACK_CACHE_SIZE)) != 0)  // if some data have been retrieved
  {
void updatePrintTime(void)
{
  if (infoPrinting.printing && !infoPrinting.paused)
  {
    infoPrinting.elapsedTime++;

    if (infoPrinting.remainingTime > 0 && !heatHasWaiting())
      infoPrinting.remainingTime--;
  }
  infoHost.result = infoHost.counter; // IRON, debugging
  infoHost.counter = 1;
}

I cheat a bit... initialize counter to 1 :D

rondlh commented 1 year ago

Unrelated question: Do you know which function call takes the longest within the TFT? The initial draw of a menu?

hmm no, there a lot of functions invoked by loopProcess also making different things according to TFT status or configurations

I understand, but is there any task/process that can keep the MCU busy for more than let's say 10ms? The loopProcess is very fast on average, but there might be some small hickups, like when you just change to another menu. I want to know how long that takes. I expect no problems actually, just checking.

UPDATE: I just did a quick benchmark on the PrintMenu, it takes 112ms to draw the menu, which is quite long. With RAPID_PRINTING_COMM enabled it's not a problem, because loopBackEnd will be inserted between the drawing of each item with this macro:

#define RAPID_PRINTING_COMM() if (isPrinting() == true && infoSettings.serial_always_on != 1) {loopBackEnd();}

I think this should be:

#define RAPID_PRINTING_COMM() if (isPrinting() == true && infoSettings.serial_always_on == DISABLED) {loopBackEnd();}

The consistency in the code is a bit poor at places, just search for "serial_always_on =="

digant73 commented 1 year ago

do you have the issue with lost char with all writing methods or only with one of them in particular?

this inline uint16_t Serial_GetWritingIndex(uint8_t port) in Serial.c does not make the function inline. You should define it as static inline in Serial.h. The same for newDataAvailable(SERIAL_PORT), if you defined it in the same way.

Do you think it is possible to implement at least the interrupt based serial writing also for STM32F10x and/or GD32F20x? it will be even enough (with the changes already done in Serial_Get) to clearly improve the TFT reading (fixing the issue with messages exceeding the RX buffer size) and transmission performance (bottleneck not decoupling RX from TX) . I could test the STM32F10x version

rondlh commented 1 year ago

do you have the issue with lost char with all writing methods or only with one of them in particular?

  1. YES, at first I thought it was DMA related, and I wasted a lot of time investigating it, then I found the problem also occurs with the standard serial (interrupt write also). So it seems it's just a serial communication issue. I expect it's more common than we think.

this inline uint16_t Serial_GetWritingIndex(uint8_t port) in Serial.c does not make the function inline. You should define it as static inline in Serial.h. The same for newDataAvailable(SERIAL_PORT), if you defined it in the same way.

  1. OK, interesting... struggling a bit with it, getting error messages. I basically just want a copy of the function where it is used, not a function call. Kind of like a #define I guess, but with a #define you can run into scope issues very easily. I found this info: A “static” function is one whose scope is only the current source file. It cannot be called from code compiled in any other source file. An “inline” function is one that is expanded in-place wherever it is mentioned within the current source file and is actually not really a function at all.
  1. I guess newDataAvailable should be Serial_NewDataAvailable, following your guidelines...

  2. And please don't forget to update your Serial_PutChar... I made some changes. (check the serial.c I send to you before)

BTW: I would recommend to make these often used function and parameters uint32_T, it's faster. It gave a 9% performance gain for Serial_Get, only by changing the types (7x).

Do you think it is possible to implement at least the interrupt based serial writing also for STM32F10x and/or GD32F20x? it will be even enough (with the changes already done in Serial_Get) to clearly improve the TFT reading and transmission performance. I could test the STM32F10x version

  1. I could try to implement the code for the other HALs too, but I don't have the hardware. I can send you the serial files for F10x code, it will probably take several tries to get it right. DMA or Interrupt, or both? I could buy a BTT TFT35 V3.0.1, benchmarking it would be quite interesting.
digant73 commented 1 year ago

for 2) You get a compile error because you should also define Serial data in Serial.h. For that reasons I always prefer to avoid inline functions (they make the code less readable and more difficult to maintain). I would honestly avoid to define both Serial_GetWritingIndex and Serial_NewDataAvailable (this last one is not even small code so if called in different places it will waste flash space) as inline. for 3) same as 2) for 5) yes I would start with TX interrupt based writing (it should be more easy to implement) and I could test it on STM32F10. I would add DMA support even later. Just start with interrupt version and test it. IMHO it is just a big improvement (it is enough to decouple RX, loopProcess and TX)

rondlh commented 1 year ago

for 2) You get a compile error because you should also define Serial data in Serial.h. For that reasons I always prefer to avoid inline functions (they make the code less readable and more difficult to maintain). I would honestly avoid to define both Serial_GetWritingIndex and Serial_NewDataAvailable (this last one is not even small code so if called in different places it will waste flash space) as inline. for 3) same as 2) for 5) yes I would start with TX interrupt based writing (it should be more easy to implement) and I could test it on STM32F10. I would add DMA support even later. Just start with interrupt version and test it. IMHO it is just a big improvement (it is enough to decouple RX, loopProcess and TX)

I just tested it. Inline at Serial_GetWritingIndex and Serial_NewDataAvailable does NOTHING at all. Firmware checksums are identical. That means that the inlinedirective doesn't do anything (here) or it's even inlined without it.

Just do what you think is best, I can benchmark things (speed/filesize) and let you know if I find any worthwhile improvements.

digant73 commented 1 year ago

I think the all in one Serial_Get I posted yesterday is the best one (compact and readable). I will also make some benchmark with your NewDataAvailable and other changes. IMHO, if possible from your side, I would try to implement the TX interrupt based writing for GD and STM32F10 version. I would also test the STM32F10 version in deep. After that we could open a PR implementing your open FR #2835 (new Serial_Get and new TX interrupt based writing). With both (and also the Advanced OK feature) the TFT received an important improvement with very few code.

rondlh commented 1 year ago

I think the all in one Serial_Get I posted yesterday is the best one (compact and readable). I will also make some benchmark with your NewDataAvailable and other changes. IMHO, if possible from your side, I would try to implement the TX interrupt based writing for GD and STM32F10 version. I would also test the STM32F10 version in deep. After that we could open a PR implementing your open FR (new Serial_Get and new TX interrupt based writing). With both (and also the Advanced OK feature) the TFT received an important improvement with very few code.

I would like to have the ADVANCED_OK code #2824 merged asap, it's stable and a big step forwards. Further steps all fine with me.

You mean the code in the Serial.c you posted? That one looks fine to me. I think the whole file is the same as mine except for the Serial_PutChar(outdated!) and some minor details (order of functions, inline directive). I will work on the other platforms... I bought a GD32 based TFT, should arrive this Friday.

digant73 commented 1 year ago

I think the all in one Serial_Get I posted yesterday is the best one (compact and readable). I will also make some benchmark with your NewDataAvailable and other changes. IMHO, if possible from your side, I would try to implement the TX interrupt based writing for GD and STM32F10 version. I would also test the STM32F10 version in deep. After that we could open a PR implementing your open FR (new Serial_Get and new TX interrupt based writing). With both (and also the Advanced OK feature) the TFT received an important improvement with very few code.

I would like to have the ADVANCED_OK code #2824 merged asap, it's stable and a big step forwards. Further steps all fine with me.

I hope it will be merged soon too. The code to handle advanced ok is very small (about 250 byte without the extra monitoring menu). The PR includes many other minor cleanup/improvements/bugfixes.

You mean the code in the Serial.c you posted? That one looks fine to me. I think the whole file is the same as mine except for the Serial_PutChar(outdated!) and some minor details (order of functions, inline directive). I will work on the other platforms... I bought a GD32 based TFT, should arrive this Friday.

Serial.c is basically your implementation. I was referring more to Serial_Get function (AIO I posted yesterday). It includes the newDataAvailable. By the way I will make some test with/without other changes (there is time to test here and there, no problem). Of course your Serial.c requires a cleanup (e.g. avoid to enable IDLE interrupt due it is no more needed etc...). We first need to implement the most easy interrupt based writing method for all the 3 chips. IMHO that will be enough to open the PR while in the meanwhile working/testing the DMA solution. I would make possible in Serial.c to select the method (using #ifdef). But there is time for making this cleanup once we have the implementation for the 3 chips. We can ask the user having GD to make some testing.

rondlh commented 1 year ago

Sorry, but I'm still confused about which Serial_Get you mean, the one 3 pages above here is fine, it's the same as in Serial.c I believe. If you mean another one then could you please post it again?

BTW: The monitoring menu is really nice, that's where I show the benchmark results (I added a few lines). I also show which serial write is used (no buffering, interrupt, DMA). Very useful and quick to access from anywhere, even when printing.

The gcode checksum function really helps me to test if the communication is working well. Marlin will report any issue immediately. Comparing gcode logfile is a bit painful because a lot of regular expression filtering is needed. I have tested the interrupt and DMA serial send code on STM32F2_F4 for a long time. I think it is working fine, but please test it.

digant73 commented 1 year ago

Sorry, but I'm still confused about which Serial_Get you mean, the one 3 pages above here is fine, it's the same as in Serial.c I believe. If you mean another one then could you please post it again?

BTW: The monitoring menu is really nice, that's where I show the benchmark results (I added a few lines). I also show which serial write is used (no buffering, interrupt, DMA). Very useful and quick to access from anywhere, even when printing.

The gcode checksum function really helps me to test if the communication is working well. Marlin will report any issue immediately. Comparing gcode logfile is a bit painful because a lot of regular expression filtering is needed. I have tested the interrupt and DMA serial send code on STM32F2_F4 for a long time. I think it is working fine, but please test it.

I'm referring to this (the Serial_Get I'm currently using):

uint16_t Serial_Get(SERIAL_PORT_INDEX portIndex, char * buf, uint16_t bufSize)
{
  // wIndex: update L1 cache's writing index (dynamically changed (by L1 cache's interrupt handler) variables/attributes)
  //         and make a static access (32 bit) to it to speedup performance on this function
  //
  uint32_t wIndex = dmaL1DataRX[portIndex].wIndex = Serial_GetWritingIndex(portIndex);  // get the latest wIndex
  uint32_t flag = dmaL1DataRX[portIndex].flag;                                          // get the current flag position

  if (flag == wIndex)  // if no data to read from L1 cache, nothing to do
    return 0;

  uint32_t cacheSize = dmaL1DataRX[portIndex].cacheSize;

  while (dmaL1DataRX[portIndex].cache[flag] != '\n' && flag != wIndex)  // check presence of "\n" in available data
  {
    flag = (flag + 1) % cacheSize;
  }

  if (flag == wIndex)  // if "\n" was not found (message incomplete), update flag and exit
  {
    dmaL1DataRX[portIndex].flag = flag;  // update queue's custom flag with flag (also equal to wIndex)

    return 0;
  }

  // rIndex: L1 cache's reading index (not dynamically changed (by L1 cache's interrupt handler) variables/attributes)
  //
  DMA_CIRCULAR_BUFFER * dmaL1Data_ptr = &dmaL1DataRX[portIndex];
  char * cache = dmaL1Data_ptr->cache;
  uint32_t rIndex = dmaL1Data_ptr->rIndex;

  while (cache[rIndex] == ' ' && rIndex != flag)  // remove leading empty space, if any
  {
    rIndex = (rIndex + 1) % cacheSize;
  }

  // msgSize: message size (after updating rIndex removing leading empty spaces). Last +1 is for the terminating null character '\0'
  uint32_t msgSize = (cacheSize + flag - rIndex) % cacheSize + 2;

  // if buf size is not enough to store the data plus the terminating null character "\0", skip the data copy
  //
  // NOTE: the following check should never be matched if buf has a proper size and there is no reading error.
  //       If so, the check could be commented out just to improve performance. Just keep it to make the code more robust
  //
  if (bufSize < msgSize)
    return 0;

  if (rIndex <= flag)  // data is one chunk only, from rIndex to flag
  {
    memcpy(buf, &cache[rIndex], msgSize - 1);
    buf += msgSize - 1;
  }
  else  // data at end and beginning of cache
  {
    memcpy(buf, &cache[rIndex], cacheSize - rIndex);
    buf += cacheSize - rIndex;

    memcpy(buf, cache, flag + 1);
    buf += flag + 1;
  }

  *buf = '\0';  // add end character

  // update queue's custom flag and reading index with next index
  dmaL1Data_ptr->flag = dmaL1Data_ptr->rIndex = (flag + 1) % cacheSize;

  return msgSize;  // return the number of bytes stored in buf
}

Your Serial_newDataAvailable fucntion is inside it and not externally called just to establish if Serial_Get has to be called or not. I would make things easy (one call with three params instead of two functions. Remember that at the end Serial_newDataAvailable was not even inline so you are currently calling a real function with one param plus another function with 3 params when a new message is available). As reported in your other thread, the priorities on loopProcess is already the best way to increase a lot the scan rate for backend most important tasks (e.g. Serial_Get()). I'm currently not using the boost provided by the priorities in loopProcess and the goal to reach both reading and writing of even long messages (e.g M43 output) is already reached while it is not possible at all with the current BTT fw even if you apply priorities on loopProcess..

EDIT: in my code I also increased the TX buffer size to 256 for each serial port. M43 output provides messages of more than 200 bytes, so a buffer of 256 byte (1K total for 4 serial ports) is enough to avoid any blocking period on your Serial.c. It should be enough even in case ADVANCED_OK is enabled (e.g. having up to 8 free TX slots, so the possibility that the TFT is requested to buffer 8 messages in a very short period. With an average size of gcode of 30 bytes a buffer of 256 bytes is enough to hold 8 gcodes without any need to wait (so also not blocking Serial_Put and thus also loopProcess) for free space). Another of my current PRs is reducing memory usage by more than 2.2K so the overall 1K I assigned to TX buffers is still less than 50% of freed memory

rondlh commented 1 year ago

Your Serial_newDataAvailable fucntion is inside it and not externally called just to establish if Serial_Get has to be called or not. I would make things easy (one call with three params instead of two functions. Remember that at the end Serial_newDataAvailable was not even inline so you are currently calling a real function with one param plus another function with 3 params when a new message is available).

Benchmark: External Serial_newDataAvailable: 151K scans/s Internal Serial_newDataAvailable: 144K scans/s (7% slower)

As reported in your other thread, the priorities on loopProcess is already the best way to increase a lot the scan rate for backend most important tasks (e.g. Serial_Get()). I'm currently not using the boost provided by the priorities in loopProcess and the goal to reach both reading and writing of even long messages (e.g M43 output) is already reached while it is not possible at all with the current BTT fw even if you apply priorities on loopProcess..

Please note that what I write in #2836 is mostly theoretical, I just started using these changes a few days ago. Any feedback, good or bad is welcome. As you know, the main reason why M43 doesn't work with the current master is the Idle interrupt approach. The TFT would only become aware of the message after the buffer overrun has already occurred. The new Serial_Get mostly solves this issue, and also buffered writing will be another step in the right direction. In this situation Send and Receive are racing against each other at the same speed. RX is always about 1 line ahead.

EDIT: in my code I also increased the TX buffer size to 256 for each serial port. M43 output provides messages of more than 200 bytes, so a buffer of 256 byte (1K total for 4 serial ports) is enough to avoid any blocking period on your Serial.c. It should be enough even in case advanced ok is enabled (e.g. having up to 8 free TX slots so the possibility that the TFT will buffer 8 messages in a very short period. With an average size of gcode of 30 bytes a buffer of 256 bytes is enough). Another of my current PRs is reducing memory usage by more than 2.2K so the 1K assigned to TX buffers is still less than 50% of freed memory

I was conservative with allocating the TX buffer, but 256 bytes seems reasonable. Perhaps SERIAL_PORT could have 512 bytes. With the new approach the RX buffer size can be reduced. It's at 4K for the BTT TFT34V3.0 (96K Ram). The smallest amount of RAM on any of the supported boards is 48K, then the RX buffer would be 3K, but I guess with the new approach 1K would easily be enough. (Needs testing). So in total the memory usage does not have to increase anyway.

digant73 commented 1 year ago

Your Serial_newDataAvailable fucntion is inside it and not externally called just to establish if Serial_Get has to be called or not. I would make things easy (one call with three params instead of two functions. Remember that at the end Serial_newDataAvailable was not even inline so you are currently calling a real function with one param plus another function with 3 params when a new message is available).

Benchmark: External Serial_newDataAvailable: 151K scans/s External Serial_newDataAvailable: 144K scans/s (7% slower)

what about if using uint32_t also on function signature.? So:, something like:

**uint32_t** Serial_Get(**uint32_t** portIndex, char * buf, **uint32_t** bufSize);

As reported in your other thread, the priorities on loopProcess is already the best way to increase a lot the scan rate for backend most important tasks (e.g. Serial_Get()). I'm currently not using the boost provided by the priorities in loopProcess and the goal to reach both reading and writing of even long messages (e.g M43 output) is already reached while it is not possible at all with the current BTT fw even if you apply priorities on loopProcess..

Please note that what I write in #2836 is mostly theoretical, I just started using these changes a few days ago. Any feedback, good or bad is welcome. As you know, the main reason why M43 doesn't work with the current master is the Idle interrupt approach. The TFT would only become aware of the message after the buffer overrun has already occurred. The new Serial_Get mostly solves this issue, and also buffered writing will be another step in the right direction. In this situation Send and Receive are racing against each other at the same speed. RX is always about 1 line ahead.

Yes and no. In case you send gcodes from a TFT's serial port (e.g. wifi) the M43 issue (or similar scenarios) was also due to the TX module coupled with RX and loopProcess. The improved Serial_Get() should not be enough even with all the other changes you made to increase scan rate. TX module was the main bottleneck in the TFT (entirely coupled with RX and loopProcess). Rx and loopProcess were already decoupled but RX not well implemented as we know.

EDIT: in my code I also increased the TX buffer size to 256 for each serial port. M43 output provides messages of more than 200 bytes, so a buffer of 256 byte (1K total for 4 serial ports) is enough to avoid any blocking period on your Serial.c. It should be enough even in case advanced ok is enabled (e.g. having up to 8 free TX slots so the possibility that the TFT will buffer 8 messages in a very short period. With an average size of gcode of 30 bytes a buffer of 256 bytes is enough). Another of my current PRs is reducing memory usage by more than 2.2K so the 1K assigned to TX buffers is still less than 50% of freed memory

I was conservative with allocating the TX buffer, but 256 bytes seems reasonable. Perhaps SERIAL_PORT could have 512 bytes. With the new approach the RX buffer size can be reduced. It's at 4K for the BTT TFT34V3.0 (96K Ram). The smallest amount of RAM on any of the supported boards is 48K, then the RX buffer would be 3K, but I guess with the new approach 1K would easily be enough. (Needs testing). So in total the memory usage does not have to increase anyway.

We could even set TX buffer size to 512 for main port without decreasing anything else in particular with the PR freeing more than 2.2K of RAM.

rondlh commented 1 year ago

Additional: Please note that I didn't forget to update the while in parseACK for this benchmark:

void parseACK(void)
{
//  while (infoHost.counter++ && Serial_NewDataAvailable(SERIAL_PORT) && (ack_len = Serial_Get(SERIAL_PORT, ack_cache, ACK_CACHE_SIZE)) != 0)  // if some data have been retrieved
  while (infoHost.counter++ && (ack_len = Serial_Get(SERIAL_PORT, ack_cache, ACK_CACHE_SIZE)) != 0)  // if some data have been retrieved

what about if using uint32_t also on function signature.? So:, something like: **uint32_t** Serial_Get(**uint32_t** portIndex, char * buf, **uint32_t** bufSize);

I partly did that already, because I'm too lazy to change the prototype in the h-file.

So for both I used: uint32_t Serial_Get(SERIAL_PORT_INDEX portIndex, char * buf, uint32_t bufSize)

I didn't change the type of SERIAL_PORT_INDEX, because that would have a very wide spread impact, perhaps impacting the HAL, and sometimes uint8_t instead of SERIAL_PORT_INDEX is used.

I gave it a try anyway... Update: I changed the ports to 32 bit...

  1. Forced SERIAL_PORT_INDEX to be 32 bit
  2. Replaced "(uint8_t port" with "(uint32_t port" globally.
  3. SERIAL_PORT_INFO changed port to 32bit

The scanrate increased by 12% AMAZING!!! Do you remember the 151K scanrate above...it's now at 169K. If you have time... please do some peer review :D You might have different numbers, but you still should see a considerable improvement.

digant73 commented 1 year ago

on my side I have 163K on idle with every things enabled (advanced_ok, file comment, wifi enabled at 500Kb). Disabling file comment and wifi it is 192K on idle and 161K on printing. Disabling file comment (wifi enabled) it is 170K on idle and 145K on printing.

Using:

  // Handle USB communication
  #ifdef USB_FLASH_DRIVE_SUPPORT
    USB_LoopProcess();
  #endif

EDIT: basically the same values (maybe 1k more) in case I use DMA based writing

rondlh commented 1 year ago

EDIT: basically the same values (maybe 1k more) in case I use DMA based writing

Thanks, it's good to know I didn't make a silly mistake somewhere causing these insane scan rates. I didn't expect such a big impact by only changing the port to 32 bits. Good call on the USB communication, I also moved it up now. Reshuffling the processes in BackEnd still needs some work I guess.

I have file comments and Wifi enabled (on port 3). The numbers given are idle numbers. Buffered writing keep the scan rates high during printing. I don't have loopBuzzer in the loopProcess, because I use event triggering, not polling. I use PWM based sound generation, not interrupt based sounds. If possible moving processes out of loopProcess is a good idea. For the frontEnd it doesn't matter so much anymore if we give it low priority. I didn't scan for any potential candidates. Do you have any idea how fast the STM32F1 is compared to the F2?

One issue I see, not sure if it's new, it seems that if the ESP3D sends a message to the TFT serial port before the host is connected, then the connection between the TFT and Host does not get established anymore. Do you understand why this happens?

digant73 commented 1 year ago

EDIT: basically the same values (maybe 1k more) in case I use DMA based writing

Thanks, it's good to know I didn't make a silly mistake somewhere causing these insane scan rates. I didn't expect such a big impact by only changing the port to 32 bits.

I didn't apply the last changes to uint32_t you provided globally. Eventually tell me what you have changed so I will try on my side. Remember that I'm also applying my PR with advanced ok that improved also something here and there.

Good call on the USB communication, I also moved it up now. Reshuffling the processes in BackEnd still needs some work I guess.

Processes sequence on back end and front end functions should be OK (only USB function needed to be moved when using the priority counter).

I have file comments and Wifi enabled (on port 3). The numbers given are idle numbers. Buffered writing keep the scan rates high during printing. I don't have loopBuzzer in the loopProcess, because I use event triggering, not polling. I use PWM based sound generation, not interrupt based sounds. If possible moving processes out of loopProcess is a good idea. For the frontEnd it doesn't matter so much anymore if we give it low priority. I didn't scan for any potential candidates. Do you have any idea how fast the STM32F1 is compared to the F2?

As buffered writing do you mean the old inefficient method or the new interrupt based method? STM32F1 is far slower than F2 so the improvements on RX and TX should provide even more important results

One issue I see, not sure if it's new, it seems that if the ESP3D sends a message to the TFT serial port before the host is connected, then the connection between the TFT and Host does not get established anymore. Do you understand why this happens?

I fixed a similar bug on one of my printers. It seems more a bug in Marlin (it doesn't reply to some gcodes). In my case the bug was present if a SMART runout sensor was connected and enabled in the TFT. The TFT sent a gcode at boot that Marlin didn't reply at all. The bug has been fixed in the PR #2824

digant73 commented 1 year ago

I added the scan rate KPI on the Monitoring menu! I defined the infoMonitoring data struct where other KPI can be eventually. added. As made for other features, I defined macros to handle the KPI (e.g. UPD_SCAN_RATE()) and I put them in the code independently by DEBUG_MONITORING enabled/disabled (if disabled the macros are expanded to empty commands)

EDIT: making some changes on Serial_Put() to speedup the copy

rondlh commented 1 year ago

Thanks, that is very useful input.

I didn't apply the last changes to uint32_t you provided globally. Eventually tell me what you have changed so I will try on my side. Remember that I'm also applying my PR with advanced ok that improved also something here and there.

I guess you mean the change of the serial port from 8 to 32 bit? I just did a quick and dirty hack to see if it even compiles... but it worked the first time: I literally just did what I wrote:

  1. SERIAL_PORT_INDEX is an enumeration, I check the size, it was 8 bit, so it forced it to be 32 bit (SerialConnection.h)
  2. in the SERIAL_PORT_INFO struct I changed port to uint32_t (SerialConnection.h)
  3. I did a replacement of all occurrences of "(uint8_t port" ---> "(uint32_t port". This happens 118 times in total, in18 files. I'm not sure how much each step contributes, but the total gain was definitely worth while.

As buffered writing do you mean the old inefficient method or the new interrupt based method?

With "buffered writing" I mean, interrupt based writing or DMA based writing (they use a buffer). The old way of writting is unbuffered, so there is no need to allocate the TX buffer.

STM32F1 is far slower than F2 so the improvements on RX and TX should provide even more important results

Half the speed? Or even less? I guess F1 will benefit most from speed gains.

I fixed a similar bug on one of my printers. It seems more a bug in Marlin (it doesn't reply to some gcodes). In my case the bug was present if a SMART runout sensor was connected and enabled in the TFT. The TFT sent a gcode at boot that Marlin didn't reply at all. The bug has been fixed in the PR https://github.com/bigtreetech/BIGTREETECH-TouchScreenFirmware/pull/2824

I'm using #2824 as my basis, so I really would like to get it merged. But I still have the connection issue. If I hard reset the TFT then they connect fine. At startup I must make sure that the ESP3D doesn't send the IP address and time too early otherwise the connection issue occurs. I just updated to you latest changes in #2824.

I added a few lines to Monitoring.c

#ifdef INTERRUPT_WRITE
  sprintf(str, "%s   ", "INTERRUPT WRITE");
#else
  #ifdef DMA_WRITE
    sprintf(str, "%s   ", "DMA WRITE");
  #else
    sprintf(str, "%s   ", "STANDARD WRITE");
  #endif
#endif 

Are you aware of any candidate processes that can be changed from "polling" in loopProcess to "event triggered"? loopBuzzer is a good candidate. I can post the code if you think it's useful.

digant73 commented 1 year ago

Thanks, that is very useful input.

I didn't apply the last changes to uint32_t you provided globally. Eventually tell me what you have changed so I will try on my side. Remember that I'm also applying my PR with advanced ok that improved also something here and there.

I guess you mean the change of the serial port from 8 to 32 bit? I just did a quick and dirty hack to see if it even compiles... but it worked the first time: I literally just did what I wrote:

  1. SERIAL_PORT_INDEX is an enumeration, I check the size, it was 8 bit, so it forced it to be 32 bit (SerialConnection.h)
  2. in the SERIAL_PORT_INFO struct I changed port to uint32_t (SerialConnection.h)
  3. I did a replacement of all occurrences of "(uint8_t port" ---> "(uint32_t port". This happens 118 times in total, in18 files. I'm not sure how much each step contributes, but the total gain was definitely worth while.

Ok. I limited the 8 to 32 bit conversion onli on Serial_Get and SERIAL_PORT_INFO data struct but with no evident benefit

As buffered writing do you mean the old inefficient method or the new interrupt based method?

With "buffered writing" I mean, interrupt based writing or DMA based writing (they use a buffer). The old way of writting is unbuffered, so there is no need to allocate the TX buffer.

Ok, as expected the buffered versions were clearly better than the old unbuffered version. I made some changes in Serial_Put just to improve performance as made in Serial_Get. I will test it this night and hopefully share it if providing better results than the current implementation

STM32F1 is far slower than F2 so the improvements on RX and TX should provide even more important results

Half the speed? Or even less? I guess F1 will benefit most from speed gains.

I didn't benchmark the difference but clearly the BTT TFT35V3 is faster for everything. Yes, I think a prototype of TX buffered version should be more useful now than trying to optimize so much other things in TFT. I would leave them at the end.

I fixed a similar bug on one of my printers. It seems more a bug in Marlin (it doesn't reply to some gcodes). In my case the bug was present if a SMART runout sensor was connected and enabled in the TFT. The TFT sent a gcode at boot that Marlin didn't reply at all. The bug has been fixed in the PR #2824

I'm using #2824 as my basis, so I really would like to get it merged. But I still have the connection issue. If I hard reset the TFT then they connect fine. At startup I must make sure that the ESP3D doesn't send the IP address and time too early otherwise the connection issue occurs. I just updated to you latest changes in #2824.

To verify the issue, move in the Monitoring menu and check if tx_count is different than 0. If so, you have the same issue I had with my printer. I had this bug recently after updating Marlin fw. It is a 100% bug on Marlin. Instead of hard reset try to disconnect and reconnect the TFT using the disconnect button on TFT's Connection menu. It worked in my case (the TFT was able to connect to mainboard).

I added a few lines to Monitoring.c

#ifdef INTERRUPT_WRITE
  sprintf(str, "%s   ", "INTERRUPT WRITE");
#else
  #ifdef DMA_WRITE
    sprintf(str, "%s   ", "DMA WRITE");
  #else
    sprintf(str, "%s   ", "STANDARD WRITE");
  #endif
#endif 

Are you aware of any candidate processes that can be changed from "polling" in loopProcess to "event triggered"? loopBuzzer is a good candidate. I can post the code if you think it's useful.

I would avoid to add too much things in this moment. TX version and Priority counters already improved a lot performance far more than necessary to avoid bottlenecks. I would complete the work on TX first so I can also make clean up and possibly create a PR (or eventually you can open it if you prefer).

rondlh commented 1 year ago

Thanks for the input, I will test the connection issue.

I would avoid to add too much things in this moment. TX version and Priority counters already improved a lot performance far more than necessary to avoid bottlenecks. I would complete the work on TX first so I can also make clean up and possibly create a PR (or eventually you can open it if you prefer).

Sure no need to add everything, I'm just investigating. We can see what is useful and what is not, performance should easily be fine with minor changes. It would be nice if you could open the PR, I never did it before. I will give it a try another time.

I bench marked the 32-bit serial port again, and found a much lower number than before, only 2% scan rate increase when using 32-bit ports. In the previous test I must have made a mistake somewhere.

Moving the USB section in loopBackEnd above the counter line brings my scan rate to about 112K. Not bad :D

rondlh commented 1 year ago

I got the display today... I thought. They send me an LVGL 3.5" Capacitive IPS Touch screen, ESP32 based. The demo is very powerful. Unfortunately the BTT FW doesn't run very well on it.

LVGL 35

digant73 commented 1 year ago

Thanks for the input, I will test the connection issue.

I would avoid to add too much things in this moment. TX version and Priority counters already improved a lot performance far more than necessary to avoid bottlenecks. I would complete the work on TX first so I can also make clean up and possibly create a PR (or eventually you can open it if you prefer).

Sure no need to add everything, I'm just investigating. We can see what is useful and what is not, performance should easily be fine with minor changes. It would be nice if you could open the PR, I never did it before. I will give it a try another time.

Ok, no problem. I will make some other test and then I will open a PR

I bench marked the 32-bit serial port again, and found a much lower number than before, only 2% scan rate increase when using 32-bit ports. In the previous test I must have made a mistake somewhere.

Yes I don't think conversion to 32 bit is providing any significative boost.

Moving the USB section in loopBackEnd above the counter line brings my scan rate to about 112K. Not bad :D

rondlh commented 1 year ago

2% speed gain by only changing the serial port to 32-bit is useful is you ask me, especially considering it's a tiny change with no risk. It would be best if there was a consistent serial port type. Then the choice would be easy... The official recommendation is not to use the smallest type your data fits in, but use 32-bit most of the time. Do you have some good tools to do a global replacement in the code? I didn't make the changes manually. Anyway, this is not the most urgent topic right now. Please check #2824... I found a potential issue.

Let me retest the effect of using 32-bit variables on Serial_Get again, to confirm I got it right the first time.

digant73 commented 1 year ago

I made some other optimizations in #2840. With wifi enabled it is now 180K on idle and 152K on printing (with wifi off it is 200K and 165K). Same results for interrupt based and DMA based. But I also made a comparison with the old version (unbuffered) enabling DEFAULT_WRITE in Serial.c and I get the same values. Is there any explanation for that? Compared to old version (unbuffered), I was expecting better results for interrupt and even more for DMA based writing.

EDIT: on STM32F10x I got basically half the performance than STM32F2. With wifi enabled it is now 90K on idle and 74K on printing (with wifi off it is 99K and 80K). Even in this case same result as for old version (DMA version is not yet implemented)